The UNIX School: awk - Read a file and split the contents

Tuesday, May 24, 2011

awk - Read a file and split the contents

awk is one of the most powerful utilities used in the unix world. Whenever it comes to text parsing, sed and awk do some unbelievable things. In this first article on awk, we will see the basic usage of awk.

The syntax of awk is:

awk 'pattern{action}' file

where the pattern indicates the pattern or the condition on which the action is to be executed for every line matching the pattern. In case of a pattern not being present, the action will be executed for every line of the file. In case of the action part not being present, the default action of printing the line will be done. Let us see some examples:

Assume a file, say file1, with the following content:

$ cat file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration

This file has 2 fields in it. The first field indicates the name of a person, and the second field denoting their expertise, the first line being the header record.

1. To print only the names present in the file:

$ awk '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru

The above awk command does not have any pattern or condition. Hence, the action will be executed on every line of the file. The action statement reads "print $1". awk, while reading a file, splits the different columns into $1, $2, $3 and so on. And hence the first column is accessible using $1, second using $2, etc. And hence the above command prints all the names which happens to be first column in the file.

2. Similarly, to print the second column of the file:

$ awk '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration

3. In the first example, the list of names got printed along with the header record. How to omit the header record and get only the names printed?

$ awk 'NR!=1{print $1}' file1
Deepak
Neha
Vijay
Guru

The above awk command uses a special variable NR. NR denotes line number ranging from 1 to the actual line count. The conditon 'NR!=1' indicates not to execute the action part for the first line of the file, and hence the header record gets skipped.

4. How do we print the entire file contents?

$ awk '{print $0}' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration

$0 stands for the entire line. And hence when we do "print $0", the whole line gets printed.

5. How do we get the entire file content printed in other way?

$ awk '1' file1
Name Domain
Deepak Banking
Neha Telecom
Vijay Finance
Guru Migration

The above awk command has only the pattern or condition part, no action part. The '1' in the pattern indicates "true" which means true for every line. As said above, no action part denotes just to print which is the default when no action statement is given, and hence the entire file contents get printed.

Let us now consider a file with a delimiter. The delimiter used here is a comma. The comma separated file is called csv file. Assuming the file contents to be:

$ cat file1
Name,Domain,Expertise
Deepak,Banking,MQ Series
Neha,Telecom,Power Builder
Vijay,Finance,CRM Expert
Guru,Migration,Unix

This file contains 3 fields. The new field being the expertise of the respective person.

6. Let us try to print the first column of this csv file using the same method as mentioned in Point 1.

$ awk '{print $1}' file1
Name,Domain,Expertise
Deepak,Banking,MQ
Neha,Telecom,Power
Vijay,Finance,CRM
Guru,Migration,Unix

The output looks weird. Isnt it? We expected only the first column to get printed, but it printed little more and that too not a definitive one. If you notice carefully, it printed every line till the first space is encountered. awk, by default, uses the white space as the delimiter which could be a single space, tab space or a series of spaces. And hence our original file was split into fields depending on space.

Since our requirement now involves dealing with a file which is comma separated, we need to specify the delimiter.

$ awk -F","  '{print $1}' file1
Name
Deepak
Neha
Vijay
Guru

awk has a command line option "-F' with which we can specify the delimiter. Once the delimiter is specified, awk splits the file on the basis of the delimiter specified, and hence we got the names by printing the first column $1.

7. awk has a special variable called "FS" which stands for field separator. In place of the command line option "-F', we can also use the "FS".

$ awk  '{print $1,$3}' FS="," file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix

8. Similarly, to print the second column:

$ awk -F, '{print $2}' file1
Domain
Banking
Telecom
Finance
Migration

9. To print the first and third columns, ie., the name and the expertise:

$ awk -F"," '{print $1, $3}' file1
Name Expertise
Deepak MQ Series
Neha Power Builder
Vijay CRM Expert
Guru Unix

10. The output shown above is not easily readable since the third column has more than one word. It would have been better had the fields being displayed are present with a delimiter. Say, lets use comma to separate the output. Also, lets discard the header record.

$ awk -F"," 'NR!=1{print $1,$3}' OFS="," file1
Deepak,MQ Series
Neha,Power Builder
Vijay,CRM Expert
Guru,Unix

OFS is another awk special variable. Just like how FS is used to separate the input fields, OFS (Output field separator) is used to separate the output fields.

24 comments:

AGNRAugust 28, 2013 at 10:19 AM
Thanks..it was usefull
ReplyDelete
Replies
LindylexOctober 2, 2013 at 11:17 PM
Two hours later and this saved my life extracting the second column of this comma separated file (CSV) myfile.txt

Country,City,AccentCity,Region,Population,Latitude,Longitude
ad,aixas,Aixàs,06,,42.4833333,1.4666667
ad,aixirivali,Aixirivali,06,,42.4666667,1.5
ad,aixirivall,Aixirivall,06,,42.4666667,1.5

awk -F"," 'NR!=1{print $2}' myfile.txt

Thanks, Lindylex
ReplyDelete
Replies
LindylexOctober 2, 2013 at 11:19 PM
Thanks I used this to extract the 2nd column in my CSV comma separate file myfile.txt
-------
myfile.txt

Country,City,AccentCity,Region,Population,Latitude,Longitude
ad,aixas,Aixàs,06,,42.4833333,1.4666667
ad,aixirivali,Aixirivali,06,,42.4666667,1.5
ad,aixirivall,Aixirivall,06,,42.4666667,1.5

-------
awk -F"," 'NR!=1{print $2}' myfile.txt
ReplyDelete
Replies
AnonymousApril 24, 2014 at 2:08 PM
Hi! Thanks for these usefull little tips my friend :)
ReplyDelete
Replies
Amrut & DeepaliApril 24, 2014 at 5:32 PM
Good Post !! very helpful
ReplyDelete
Replies
UnknownJuly 17, 2014 at 1:53 AM
can you do the following:
awk '{print $$index}' file
where index is a variable that holds a number?
so if index =1, it would be the same as:
awk '{print $1}' file
ReplyDelete
Replies
Andre SilveiraJuly 26, 2014 at 9:45 PM
Wow awk is so cool! Thanks for the article, it's useful and really easy to understand.
ReplyDelete
Replies
Sathish SubramaniamMarch 6, 2015 at 10:31 PM
Awesome
ReplyDelete
Replies
UnknownOctober 27, 2015 at 10:15 AM
i/p:
Line1
Line2
Line3
Line4
Line5
o/p:
Line1
Line2
Line3
Line4
Line5
I have an input as mentioned above. Line1,2,3 etc is a whole line of different words e.g create table tablename ...... The 2nd scenario is also valid where multiple lines will be between two start and end tags. Is there a way to extract this using awk ?

--Samir
ReplyDelete
Replies
UnknownDecember 12, 2015 at 1:47 AM
thank you for this good explanation
ReplyDelete
Replies
PrabhaJanuary 6, 2016 at 11:36 AM
I have a file containing following format:-

file.text:-
userread=ruser1,rusers2,ruser3
userwrite=wuser1,wuser2,wuser3

I need to list all users only in a different file.
can you please tell me to extract usernames from this file.txt.
ReplyDelete
Replies
UnknownApril 7, 2016 at 4:14 PM
Hi,

Please anyone solve my issue.

I have a file with multiple lines. I want to create a file from input file which contains lines starting from Begin to Line starting from End.

Example:

Begin:123 Its is = 01c
St:13e is =44c

End:234 sd=5C

Begin :sfvf= 7C
str:98 +c

End:hs =7C

now if i consider above as input file, I want 2 files as out which spilts above file from Begin to End
ReplyDelete
Replies
UnknownApril 16, 2016 at 7:28 PM
Hi,

Is 7th and 9th examples are correct ??
ReplyDelete
Replies
UnknownDecember 19, 2016 at 10:18 AM
Thank you very much for the detailed examples with explanations.

ReplyDelete
Replies
UnknownMarch 30, 2019 at 6:40 PM
I'm getting an issue where the output file has only one line from the whole input file primtep and that too ending with a colon
ReplyDelete
Replies
UnknownApril 16, 2019 at 4:59 PM
Hi All,

I have data like this

a,b,c,d,e
h,i,j,k,l
m,o,p,b,y

i wish to double quote each column and from 3rd column to till end it should appear as one column

"a","b","c,d,e"
"h","i","j,k,l
"m","o","p,b"

How to do it.
ReplyDelete
Replies
UnknownAugust 28, 2019 at 8:38 PM
Very good.
I'd just say that " The conditon 'NR!=1' indicates not to execute the action part for the first line of the file,"
is true, but really what it's doing is executing a test; if the line number != 1, then the test returns true, and what follows the test is executed. So in reality what it's doing is printing the line as long as it's not line number 1, rather than not printing line number 1.
ReplyDelete
Replies

Add comment

Pages

Tuesday, May 24, 2011

awk - Read a file and split the contents

24 comments: