Monday, May 28, 2012

Shell - Read a text or CSV file and extract data



   In one of our earlier articles on awk, we saw how easily awk can parse a file and extract data from it. Shell also has properties with which we can handle text files: files with fields separated by white spaces or CSV files in which the fields are separated by a comma delimiter. Let us see in this article how to read and parse a file field by field or column by column and extract data from it using the while loop of shell. This article will also explain the usage of shell while loop.

Let us consider a file with sample contents as shown below. The file contains 3 fields: OS, the company and a random value.
$ cat file
Solaris Sun 25
Linux RedHat 30
1. while command, in addition to running a loop till the condition becomes false, can also read a file and parse it line by line. This we discussed once in the 15 different ways to display the file contents. Below shown is a simple example demonstrating it:
$ cat test.sh
#!/bin/bash
while read line
do
        echo "Line is : $line"
done < file
  On running the above script:
$ ./test.sh
Line is : Solaris Sun 25
Line is : Linux RedHat 30
   The script does: Read the file named "file"(input re-direction < ). Every line read is present in the variable line. Inside the while loop, the line is printed which contains the entire line. The read condition becomes false when there are no lines to read at which point the while loop is quit.

Note: The read statement reads till a newline character is found. And hence while loop is able to parse the file line by line.

2.  Read every line into individual fields using the while loop:
$ cat test.sh
#!/bin/bash
while read f1 f2 f3
do
        echo "OS is     : $f1"
        echo "Company is: $f2"
        echo "Value is  : $f3"
done < file
 On running the above script:
$ ./test.sh
OS is     : Solaris
Company is: Sun
Value is  : 25
OS is     : Linux
Company is: RedHat
Value is  : 30
   read command can take multiple variable names as shown here in the form of f1, f2 and f3. The first field gets read into f1, second into f2 and so on.
   How does the shell do it? How does it manage to split the fields properly and allocate it to variables? It is because of the shell special variable IFS which stands for Internal Field Separator. IFS in shell is same as FS, the field separator, in awk. By default, IFS is white space which could be single space, series of single space, tab space or a new line. Hence, the shell is able to parse the fields approriately.

3. What if there are fewer variable compared to fields? i.e, If we have lesser fields in read with more fields present in file?
$ cat test.sh
#!/bin/bash

while read f1 f2
do

        echo "OS is       : $f1"
        echo "Remaining is: $f2"
done < file
 On running the above script:
$ ./test.sh
OS is       : Solaris
Remaining is: Sun 25
OS is       : Linux
Remaining is: RedHat 30
      What happened? The last variable in the read command will consume all the remaining fields. Similarly, try and see what happens in the vice-versa case when there are more variables in the read command than the number of fields present in the file.

4. Change the delimiter of a file from a single space to a colon using the while loop:
#!/bin/bash

while read f1 f2 f3
do
        echo $f1:$f2:$f3
done < file > file1
  This script also shows how to re-direct the output of a while script to a file. Here, the file is read in the while loop, and the read variables are printed using the echo statement with a delimiter.
  Also, note this:  "< file > file1". The while reads from file and writes to file1. The < indicates reading, > indicates writing to the output.

The output file contents will look as below:
$ cat file1
Solaris:Sun:25
Linux:RedHat:30

5Read every line from a CSV file into individual fields using the while loop. The CSV files are separated by a comma delimiter:

   Let us assume the sample file to contain data as below:
$ cat file.csv
Solaris,Sun,25
Linux,RedHat,30
The script:
$ cat test.sh
#!/bin/bash
IFS=","
while read f1 f2 f3
do
        echo "OS is        : $f1"
        echo "Company  is  : $f2"
        echo "Value   is   : $f3"
done < file.csv
On running the above script:
$ ./test.sh
OS is        : Solaris
Company  is  : Sun
Value   is   : 25
OS is        : Linux
Company  is  : RedHat
Value   is   : 30
   You got it correct!!  Since IFS is the one which tells the read command how to split the fields, by setting the IFS to ",", read  will now read the fields by separating them when a comma is encountered. In this way, we get the individual fields in the corresponding variables.

6. Do not change the IFS permanently.  In the above example, once the IFS is changed, it remains so till the end of the script. This may lead to a problem since the code below might have been written with the default behavior of IFS in mind. It is always considered a good practice to set the IFS to the old value once the purpose is solved.
$ cat test.sh
#!/bin/bash
OLDIFS=$IFS
IFS=","
while read f1 f2 f3
do
        echo "OS is        : $f1"
        echo "Company  is  : $f2"
        echo "Value   is   : $f3"
done < file
IFS=$OLDIFS
OR
#!/bin/bash
while IFS="," read f1 f2 f3
do
        echo "OS is        : $f1"
        echo "Company  is  : $f2"
        echo "Value   is   : $f3"
done < file
   In the first way, IFS is temporarily saved into a variable OLDIFS and once the operation is done, IFS is restored back to its old value.

   In the second way, IFS is being set to a new value with-in the while loop. Because of this, the scope of the IFS remains so only within the while loop. Once outside the while, IFS will retain its old value. You can write some echo statements before and after the while loop printing the IFS value to confirm the same.

7. Read a file with dates using the while loop and extract the date components year, month and date.

Assuming the file contents as shown below. The file contains the dates separated by a slash.:
$ cat file
2012/05/21
2012/05/22
The script to parse the dates. Same as the earlier ones, except the IFS is set to slash.:
$ cat test.sh
#!/bin/bash
OLDIFS=$IFS

IFS="/"
while read f1 f2 f3
do
        echo "Year  is : $f1"
        echo "Month is : $f2"
        echo "Date  is : $f3"

done < file
IFS=$OLDIFS
On running the above script:
$ ./test.sh
Year  is : 2012
Month is : 05
Date  is : 21
Year  is : 2012
Month is : 05
Date  is : 22
      This is now very easy to understand. Once the IFS is set to slash, we got the year, month and date components into f1, f2 and f3 respectively. Inside the while, we can use the variables for any use.

8. Reading a file with multiple delimiters in the shell:

Assuming the sample file contents as shown below. The difference here is the 1st and 2nd fields are separated by colon, whereas the 2nd and 3rd are separated by a slash.
$ cat file
Solaris:Sun/25
Linux:RedHat/30
The script to parse the above file:
$ cat test.sh
#!/bin/bash

OLDIFS=$IFS
IFS=":/"
while read f1 f2 f3
do
        echo "OS is     : $f1"
        echo "Company is: $f2"
        echo "Value is  : $f3"
done < file
IFS=$OLDIFS
On running the script:
$ ./test.sh
OS is     : Solaris
Company is: Sun
Value is  : 25
OS is     : Linux
Company is: RedHat
Value is  : 30
   Check out the IFS setting. IFS can be set to multiple values as well as shown. On setting it to colon and a slash, it starts splitting fields whenever it encounters either a colon or a slash. So, on encountering the colon first, it read the 1st column, and on encountering the slash, the 2nd column is read, and the last got in by default.

10 comments:

  1. How to store csv values in an array?

    ReplyDelete
  2. It is very useful article to understand reading data in a shell script. Very good work.

    ReplyDelete
  3. very nice and very helpful article.
    Great..

    ReplyDelete
  4. Really well explained article- thanks

    ReplyDelete
  5. If the last line does not contain a "new line" characterm then it will be omitted.

    ReplyDelete
  6. what if the contents of the file are not static, they differ from program to progarm which is going to execute the script. I want to concatenate the second column of the file, starting from second row:--
    Email|prog
    Email|first@pp.com
    Email|second@pp.com

    I want the o/p as:--
    first@pp.com;second@pp.com;

    the count of email address may vary. Please help.

    ReplyDelete
    Replies
    1. $ awk -F"|" 'NR!=1{print $2}' file | paste -sd";"
      first@pp.com;second@pp.com

      Delete