Monday, May 14, 2012

awk - Match a pattern in a file in Linux



  In one of our earlier articles on awk series, we had seen the basic usage of awk or gawk. In this, we will see mainly how to search for a pattern in a file in awk. Searching pattern in the entire line or in a specific column.

  Let us consider a csv file with the following contents. The data in the csv file contains kind of expense report. Let us see how to use awk to filter data from the file.
$ cat file
Medicine,200
Grocery,500
Rent,900
Grocery,800
Medicine,600
1. To print only the records containing Rent:
$ awk '$0 ~ /Rent/{print}' file
Rent,900
     ~ is the symbol used for pattern matching.  The / / symbols are used to specify the pattern. The above line indicates: If the line($0) contains(~) the pattern Rent, print the line. 'print' statement by default prints the entire line. This is actually the simulation of grep command using awk.

2. awk, while doing pattern matching, by default does on the entire line, and hence $0 can be left off as shown below:
$ awk '/Rent/{print}' file
Rent,900
3. Since awk prints the line by default on a true condition, print statement can also be left off.
$ awk '/Rent/' file
Rent,900
 In this example, whenever the line contains Rent, the condition becomes true and the line gets printed.

4. In the above examples, the pattern matching is done on the entire line, however, the pattern we are looking for is only on the first column.  This might lead to incorrect results if the file contains the word Rent in other places. To match a pattern only in the first column($1),
$ awk -F, '$1 ~ /Rent/' file
Rent,900
      The -F option in awk is used to specify the delimiter. It is needed here since we are going to work on the specific columns which can be retrieved only when the delimiter is known.

5. The above pattern match will also match if the first column contains "Rents". To match exactly for the word "Rent" in the first column:
$ awk -F, '$1=="Rent"' file
Rent,900
6. To print only the 2nd column for all "Medicine" records:
$ awk -F, '$1 == "Medicine"{print $2}' file
200
600
7. To match for patterns "Rent" or "Medicine" in the file:
$ awk '/Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
8. Similarly, to match for this above pattern only in the first column:
$ awk -F, '$1 ~ /Rent|Medicine/' file
Medicine,200
Rent,900
Medicine,600
9. What if the the first column contains the word "Medicines". The above example will match it as well. In order to exactly match only for Rent or Medicine,
$ awk -F, '$1 ~ /^Rent$|^Medicine$/' file
Medicine,200
Rent,900
Medicine,600
    The ^ symbol indicates beginning of the line, $ indicates the end of the line. ^Rent$ matches exactly for the word Rent in the first column, and the same is for the word Medicine as well.

10. To print the lines which does not contain the pattern Medicine:
$ awk '!/Medicine/' file
Grocery,500
Rent,900
Grocery,800
    The ! is used to negate the pattern search.

11. To negate the pattern only on the first column alone:
$ awk -F, '$1 !~ /Medicine/' file
Grocery,500
Rent,900
Grocery,800
12. To print all records whose amount is greater than 500:
$ awk -F, '$2>500' file
Rent,900
Grocery,800
Medicine,600
13. To print the Medicine record only if it is the 1st record:
$ awk 'NR==1 && /Medicine/' file
Medicine,200
    This is how the logical AND(&&) condition is used in awk.  The records needed to be retrieved is only if it is the first record(NR==1) and the record is a medicine record.

14. To print all those Medicine records whose amount is greater than 500:
$ awk -F, '/Medicine/ && $2>500' file
Medicine,600
15. To print all the Medicine records and also those records whose amount is greater than 600:
$ awk -F, '/Medicine/ || $2>600' file
Medicine,200
Rent,900
Grocery,800
Medicine,600
    This is how the logical OR(||) condition is used in awk.
Related Posts Plugin for WordPress, Blogger...

2 comments:

  1. Thanks, that help me! Are regex matching against fields and complex boolean patterns allowed in POSIX awk?

    ReplyDelete
  2. In the example above for the expenses:
    > Medicine,300
    Grocery,800
    Rent,900

    When I try to grep Rent using following commands, the behavior is different:
    > awk -F"," '$1 ~ /Rent/' expenses
    o/p - Rent,900

    > awk -F"," '$1 ~ /^Rent$/' expenses
    No o/p

    > awk '/^Rent$/' expenses
    No o/p

    > awk '/Rent/' expenses
    o/p - Rent,900

    ReplyDelete