Monday, July 30, 2012

awk - 10 examples to read files with multiple delimiters



 In this article of awk series, we will see how to use awk to read or parse text or CSV files containing multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and how to handle them using awk.

Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.
$ cat file
Item1:2010:10;20;30
Item2:2012:12;29;19
Item3:2014:15;50;61

1. To print the 3rd column which contains the prices:
$ awk -F: '{print $3}' file
10;20;30
12;29;19
15;50;61
  This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved using the $3 variable.

2. To print the 1st component of $3 alone:
$ awk -F '[:;]' '{print $4}' file
20
29
50
    What did we do here? Specified multiple delimiters, one is : and other is ; .  How awk parses the file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while reading the line, as and when the delimiter : or ; is encountered,  store the part read in $1. Continue further. Again on encountering one of the delimiters, store the read part in $2. And this continues till the end of the line is reached.  In this way, $4 contained the first part of the price component above.
Note:    Always keep in mind. While specifying multiple delimiters, it has to be specified inside square brackets( [;:] ).

3. To sum the individual components of the 3rd column and print it:
$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file
Item1:2010:60
Item2:2012:60
Item3:2014:126
      The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum them up and store in $3, and print all the variables. OFS (output field separator) is used to specify the delimiter while printing the output.
Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is space.

4. Un-group or re-group every record depending on the price column:
$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file
Item1:2010:10
Item1:2010:20
Item1:2010:30
Item2:2012:12
Item2:2012:29
Item2:2012:19
Item3:2014:15
Item3:2014:50
Item3:2014:61
     The requirement here is:  New records have to be created for every component of the price column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the price component.

5-6. Read file in which the delimiter is square brackets:
$ cat file
123;abc[202];124
125;abc[203];124
127;abc[204];124
  5.  To print the value present within the brackets:
$ awk -F '[][]' '{print $2}' file
202
203
204
     At the first sight, the delimiter used in the above command might be confusing. Its simple. 2 delimiters are to be used in this case: One is [ and the other is  ]. Since the delimiters itself is square brackets which is to be placed within the square brackets, it looks tricky at the first instance.

Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [. Using the delimiter like -F '[[]]' will give a different interpretation altogether.

  6.  To print the first value, the value within brackets, and the last value:
$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file
123;202;124
125;203;124
127;204;124
     3 delimiters are used in this case with semi-colon also included.

7-8. Read or parse a file containing a series of delimiters:
$ cat file
123;;;202;;;203
124;;;213;;;203
125;;;222;;;203
      The above file contains a series of 3 semi-colons between every 2 values.

    7. Using the multiple delimiter method:
$ awk -F'[;;;]' '{print $2}' file



    Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the first and the second semi-colon which in our case is blank and hence no output.

     8. Using the delimiter without square brackets:
$ awk -F';;;' '{print $2}' file
202
213
222
     The expected output !!!  No square brackets is used and we got the output which we wanted.

Difference between using square brackets and not using it : When a set of delimiters are specified using square brackets, it means an OR condition of the delimiters. For example, -F '[;:]' means to separate the contents either on encountering ':' or ';'. However, when a set of delimiters are specified without using square brackets, awk looks at them literally to separate the contents. For example, -F ':;' means to separate the contents only on encountering a colon followed by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3 continuous semi-colons are encountered.

9. Read or parse a file containing a series of delimiters of varying lengths:
      In the below file, the 1st and 2nd column are separated  using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons
$ cat file
123;;;202;;;;203
124;;;213;;;;203
125;;;222;;;;203
$ awk -F';'+ '{print $2,$3}' file
202 203
213 203
222 203
      The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.

10.  Using a word as a delimiter:
$ cat file
123Unix203
124Unix203
125Unix203
     Retrieve the numbers before and after the word "Unix" :
$ awk -F'Unix' '{print $1, $2}' file
123 203
124 203
125 203
     In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the appropriate values . Keep in mind, it is not just the special characters which can be used as delimiters. Even alphabets, words can also be used as delimiters.

P.S: We will discuss about the awk split command on how to use it in these types of multiple delimited files.

11 comments:

  1. 9. Read or parse a file containing a series of delimiters of varying lengths

    Exactly what I was looking for!!

    Thanks!

    ReplyDelete
  2. 10. Using a word as a delimiter:

    $ awk -F'Unix' '{print $1, $2}' file

    it is not properly work in solaris 10,

    ReplyDelete
  3. Hello, this is awesome knowledge but i have the curiosity:

    if i have the following case -F "[|;]" and my field $17 delimited by "|"
    has various delimitations by ";" i get only first result of ; sub delimitation how do i get any other value?

    example field $17:

    2013-08-04;2013-08-10;CI6785,;HG=S;BD=S;AD=2,;NI=1,;SA,;TH,;HB=S;RG=S;PO=S;VI=S;

    I want to get third field....

    Thanks in advanced

    ReplyDelete
    Replies
    1. Its not clear. If you show your complete record and tell which part you are trying to extract exactly, it will help.

      Delete
  4. Hi, I have a query. Suppose you have file1 with 3 different columns and file2 with 5 different columns. Now I want a 3rd file (lets say file3) which contains filed 1 and 2 of file1 and fields 3,4,5 of file2. How to to that through awk.

    ReplyDelete
    Replies
    1. $ cat a1
      A,1
      B,3
      $ cat a2
      A,45
      B,88
      $ paste -d, a1 a2 | awk -F, '{print $1,$3,$4}' OFS=,
      A,A,45
      B,B,88


      In future, please post the question in the unix school forum: forum.theunixschool.com

      Delete
  5. Thank you! This is a good introduction to awk.

    ReplyDelete
  6. Thank you.. works like a charm

    ReplyDelete
  7. Hi I have a query need help


    This is string

    I 443:32574/TCP,80:32098/TCP
    Need 443:32482/TCP,80:32711/TCP
    Help 443:31020/TCP,80:31581/TCP
    For 80:32061/TCP
    this 443:31381/TCP,80:32701/TCP
    string 443:31884/TCP,80:30070/TCP
    please 443:30251/TCP,80:31253/TCP
    help 80:31546/TCP,443:32019/TCP

    Want output like below

    first field + port number after 80

    I 32098
    Need 32711
    Help 31581
    For 32061
    this 32701
    string 30070
    please 31253
    help 32019

    ReplyDelete