In this article of awk series, we will see how to use awk to read or parse text or CSV files containing multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and how to handle them using awk.
Let us consider a sample file. This colon separated file contains item, purchase year and a set of prices separated by a semicolon.
$ cat file Item1:2010:10;20;30 Item2:2012:12;29;19 Item3:2014:15;50;61
1. To print the 3rd column which contains the prices:
$ awk -F: '{print $3}' file 10;20;30 12;29;19 15;50;61This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved using the $3 variable.
2. To print the 1st component of $3 alone:
$ awk -F '[:;]' '{print $4}' file 20 29 50
What did we do here? Specified multiple delimiters, one is : and other is ; . How awk parses the file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while reading the line, as and when the delimiter : or ; is encountered, store the part read in $1. Continue further. Again on encountering one of the delimiters, store the read part in $2. And this continues till the end of the line is reached. In this way, $4 contained the first part of the price component above.
Note: Always keep in mind. While specifying multiple delimiters, it has to be specified inside square brackets( [;:] ).
3. To sum the individual components of the 3rd column and print it:
$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file Item1:2010:60 Item2:2012:60 Item3:2014:126
The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum them up and store in $3, and print all the variables. OFS (output field separator) is used to specify the delimiter while printing the output.
Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is space.
4. Un-group or re-group every record depending on the price column:
$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file Item1:2010:10 Item1:2010:20 Item1:2010:30 Item2:2012:12 Item2:2012:29 Item2:2012:19 Item3:2014:15 Item3:2014:50 Item3:2014:61The requirement here is: New records have to be created for every component of the price column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the price component.
5-6. Read file in which the delimiter is square brackets:
$ cat file 123;abc[202];124 125;abc[203];124 127;abc[204];1245. To print the value present within the brackets:
$ awk -F '[][]' '{print $2}' file 202 203 204
At the first sight, the delimiter used in the above command might be confusing. Its simple. 2 delimiters are to be used in this case: One is [ and the other is ]. Since the delimiters itself is square brackets which is to be placed within the square brackets, it looks tricky at the first instance.
Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [. Using the delimiter like -F '[[]]' will give a different interpretation altogether.
6. To print the first value, the value within brackets, and the last value:
$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file 123;202;124 125;203;124 127;204;1243 delimiters are used in this case with semi-colon also included.
7-8. Read or parse a file containing a series of delimiters:
$ cat file 123;;;202;;;203 124;;;213;;;203 125;;;222;;;203The above file contains a series of 3 semi-colons between every 2 values.
7. Using the multiple delimiter method:
$ awk -F'[;;;]' '{print $2}' file
Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the first and the second semi-colon which in our case is blank and hence no output.
8. Using the delimiter without square brackets:
$ awk -F';;;' '{print $2}' file 202 213 222The expected output !!! No square brackets is used and we got the output which we wanted.
Difference between using square brackets and not using it : When a set of delimiters are specified using square brackets, it means an OR condition of the delimiters. For example, -F '[;:]' means to separate the contents either on encountering ':' or ';'. However, when a set of delimiters are specified without using square brackets, awk looks at them literally to separate the contents. For example, -F ':;' means to separate the contents only on encountering a colon followed by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3 continuous semi-colons are encountered.
9. Read or parse a file containing a series of delimiters of varying lengths:
In the below file, the 1st and 2nd column are separated using 3 semi-colons, however the 2nd and 3rd are separated by 4 semi-colons
$ cat file 123;;;202;;;;203 124;;;213;;;;203 125;;;222;;;;203
$ awk -F';'+ '{print $2,$3}' file 202 203 213 203 222 203
The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.
10. Using a word as a delimiter:
$ cat file 123Unix203 124Unix203 125Unix203Retrieve the numbers before and after the word "Unix" :
$ awk -F'Unix' '{print $1, $2}' file 123 203 124 203 125 203
In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the appropriate values . Keep in mind, it is not just the special characters which can be used as delimiters. Even alphabets, words can also be used as delimiters.
P.S: We will discuss about the awk split command on how to use it in these types of multiple delimited files.
9. Read or parse a file containing a series of delimiters of varying lengths
ReplyDeleteExactly what I was looking for!!
Thanks!
10. Using a word as a delimiter:
ReplyDelete$ awk -F'Unix' '{print $1, $2}' file
it is not properly work in solaris 10,
In solaris, try using nawk instead of awk.
DeleteHello, this is awesome knowledge but i have the curiosity:
ReplyDeleteif i have the following case -F "[|;]" and my field $17 delimited by "|"
has various delimitations by ";" i get only first result of ; sub delimitation how do i get any other value?
example field $17:
2013-08-04;2013-08-10;CI6785,;HG=S;BD=S;AD=2,;NI=1,;SA,;TH,;HB=S;RG=S;PO=S;VI=S;
I want to get third field....
Thanks in advanced
Its not clear. If you show your complete record and tell which part you are trying to extract exactly, it will help.
DeleteHi, I have a query. Suppose you have file1 with 3 different columns and file2 with 5 different columns. Now I want a 3rd file (lets say file3) which contains filed 1 and 2 of file1 and fields 3,4,5 of file2. How to to that through awk.
ReplyDelete$ cat a1
DeleteA,1
B,3
$ cat a2
A,45
B,88
$ paste -d, a1 a2 | awk -F, '{print $1,$3,$4}' OFS=,
A,A,45
B,B,88
In future, please post the question in the unix school forum: forum.theunixschool.com
Thank you! This is a good introduction to awk.
ReplyDeleteExcellent article.
ReplyDeleteThank you.. works like a charm
ReplyDeleteHi I have a query need help
ReplyDeleteThis is string
I 443:32574/TCP,80:32098/TCP
Need 443:32482/TCP,80:32711/TCP
Help 443:31020/TCP,80:31581/TCP
For 80:32061/TCP
this 443:31381/TCP,80:32701/TCP
string 443:31884/TCP,80:30070/TCP
please 443:30251/TCP,80:31253/TCP
help 80:31546/TCP,443:32019/TCP
Want output like below
first field + port number after 80
I 32098
Need 32711
Help 31581
For 32061
this 32701
string 30070
please 31253
help 32019