Thursday, June 28, 2012

awk - 10 examples to split a file into multiple files



 In this article of the awk series, we will see the different scenarios in which we need to split a file into multiple files using awk. The files can be split into multiple files either based on a condition, or based on a pattern or because the file is big and hence needs to split into smaller files.

Sample File1:
Let us consider a sample file with the following contents:
$ cat file1
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600
1. Split the file into 3 different files, one for each item. i.e, All records pertaining to Item1 into a file, records of Item2 into another, etc.
$ awk -F, '{print > $1}' file1
       The files generated by the above command are as  below:
$ cat Item1
Item1,200
Item1,600
$ cat Item3
Item3,900
$ cat Item2
Item2,500
Item2,800
   This looks so simple, right? print prints the entire line, and the line is printed to a file whose name is $1, which is the first field. This means, the first record will get written to a file named 'Item1', and the second record to 'Item2', third to 'Item3',  4th goes to 'Item2', and so on.

2. Split the files by having an extension of .txt to the new file names.
$ awk -F, '{print > $1".txt"}' file1
    The only change here from the above is concatenating the string ".txt" to the $1 which is the first field. As a result, we get the extension to the file names. The files created are below:
$ ls *.txt
Item2.txt  Item1.txt  Item3.txt
3. Split the files by having only the value(the second field) in the individual files, i.e, only 2nd field in the new files without the 1st field:
$ awk -F, '{print $2 > $1".txt"}' file1
     The print command prints the entire record. Since we want only the second field to go to the output files, we do: print $2.
$ cat Item1.txt
200
600
4. Split the files so that all the items whose value is greater than 500 are in the file "500G.txt", and the rest in the  file "500L.txt".
$ awk -F, '{if($2<=500)print > "500L.txt";else print > "500G.txt"}' file1
        The output files created will be as below:
$ cat 500L.txt
Item1,200
Item2,500
$ cat 500G.txt
Item3,900
Item2,800
Item1,600
     Check the second field($2). If it is lesser or equal to 500, the record goes to "500L.txt", else to "500G.txt".
  Other way to achieve the same thing is using the ternary operator in awk:
$ awk -F, '{x=($2<=500)?"500L.txt":"500G.txt"; print > x}' file1
      The condition for greater or lesser than 500 is checked and the appropriate file name is assigned to variable x. The record is then written to the file present in the variable x.

Sample File2:
Let us consider another file with a different set of contents. This file has a pattern 'START' at frequent intervals.
$ cat file2
START
Unix
Linux
START
Solaris
Aix
SCO
5. Split the file into multiple files at every occurrence of the pattern START .
$ awk '/START/{x="F"++i;}{print > x;}' file2
     This command contains 2 sets of curly braces: The control goes to the first set of braces only on encountering a line containing the pattern START. The second set will be encountered by every line since there is no condition and hence always true.
   On encountering the pattern START, a new file name is created and stored. When the first START comes, x will contain "F1" and the control goes to the next set of braces and the record is written to F1, and the subsequent records go the file "F1" till the next START comes. On encountering next START, x will contain "F2" and the subsequent lines goes to "F2" till the next START, and it continues.
$ cat F1
START
Unix
Linux
Solaris
$ cat F2
START
Aix
SCO
6. Split the file into multiple files at every occurrence of the pattern START. But the line containing the pattern should not be in the new files.
$ awk '/START/{x="F"++i;next}{print > x;}' file2
       The only difference in this from the above is the inclusion of the next command. Due to the next command, the lines containing the START enters the first curly braces and then starts reading the next line immediately due to the next command. As a result, the START lines does not get to the second curly braces and hence the START does not appear in the split files.
$ cat F1
Unix
Linux
$ cat F2
Solaris
Aix
SCO
7. Split the file by inserting a header record in every new file.
$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print > x;}' file2
       The change here from the earlier one is this: Before the next command, we write the header record into the file. This is the right place to write the header record since this is where the file is created first.
$ cat F1
ANY HEADER
Unix
Linux
$ cat F2
ANY HEADER
Solaris
Aix
SCO
Sample File3:
Let us consider a file with the sample contents:
$ cat file3
Unix
Linux
Solaris
AIX
SCO
8. Split the file into multiple files at every 3rd line . i.e, First 3 lines into F1, next 3 lines into F2 and so on.
$ awk 'NR%3==1{x="F"++i;}{print > x}'  file3
      In other words, this is nothing but splitting the file into equal parts. The condition does the trick here: NR%3==1 : NR is the line number of the current record. NR%3 will be equal to 1 for every 3rd line such as 1st, 4th, 7th and so on. And at every 3rd line, the file name is changed in the variable x, and hence the records are written to the appropriate files.
$ cat F1
Unix
Linux
Solaris
$ cat F2
Aix
SCO 
Sample File4:
Let us update the above file with a header and trailer:
$ cat file4
HEADER
Unix
Linux
Solaris
AIX
SCO
TRAILER
9. Split the file at every 3rd line without the header and trailer in the new files.
sed '1d;$d;' file4 | awk 'NR%3==1{x="F"++i;}{print > x}' 
         The earlier command does the work for us, only thing is to pass to the above command without the header and trailer. sed does it for us.  '1d' is to delete the 1st line, '$d' to delete the last line.
$ cat F1
Unix
Linux
Solaris
$ cat F2
AIX
SCO
10.  Split the file at every 3rd line, retaining the header and trailer in every file.
$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print f>x;}{print > x}END{for(j=1;j<i;j++)print> a[j];}' file4
     This one is little tricky. Before the file is processed, the first line is read using getline into the variable f. NR%3 is checked with 2 instead of 1 as in the earlier case because since the first line is a header, we need to split the files at 2nd, 5th, 8th lines, and so on. All the file names are stored in the array "a" for later processing.
    Without the END label, all the files will have the header record, but only the last file will have the trailer record. So, the END label is to precisely write the trailer record to all the files other than the last file.
$ cat F1
HEADER
Unix
Linux
Solaris
TRAILER
$ cat F2
HEADER
Aix
SCO
TRAILER

18 comments:

  1. hi
    is there an awk command to split a file in terms of bytes.I had a file of 300 MB and I wanted only the first 100 mb of that file.I used
    split -b 100M filename newfile.
    However it split into 3 files of size 104857600 ,104857600 and 9666150 .
    they were not exactly 100 mb size not even the first one.Is there any option?

    ReplyDelete
    Replies
    1. The file sizes of your 1st two split files are indeed 100MB. split command has done what you are looking for.

      Delete
  2. Hi,
    In the first example , "awk -F, '{print > $1}' file1", can you tell me how we can handle two scenarios 1. handling case sensitivity of the field which we use for splitting . (Eg : Item1 , item1,ITEM1 should go a same file)
    2. Handling NULL values. If NULL values are coming the records are skipped and will not generate a new file for NULL.
    -- Deepesh

    ReplyDelete
  3. Hi,

    I have a file in the below format:
    ::::::::::::::
    AMS-PRIV-10
    ::::::::::::::
    AUTRE-10009608
    AVIDT-10009807
    BBTAC-TEST-VRF
    BLKBD-10005163
    COMME-10009455
    CSION-10003183
    CTESO-10011350
    DCDCC-10005576
    ECHSA-10006326
    FUSEN-10010801
    GEHLC-10007350
    GLABS-10005004
    GLOBE-10001386
    GSADM-10009406
    ::::::::::::::
    ANC-PRIV-10
    ::::::::::::::
    BBTAC-TEST-VRF
    BURCF-10005877
    CLSIN-10007729
    DOJNX-10005331
    DPTVA-10001225
    DPTVA-15001225
    GABRL-10008510
    GLOWP-10006019
    ::::::::::::::

    So here I need to split this file in to two files. The first file name should be the "AMS-PRIV-10" and the content should be the lines present between the "::::::::::::::" and the next file name is "ANC-PRIV-10" and the content should be the lines present till the end of "::::::::::::"

    Awaiting your response.

    Thanks,
    -Venkat

    ReplyDelete
    Replies
    1. awk '/^:/{getline x;getline;next}{print > x}' file

      Please post questions in our forum(Q&A tab above) in future.

      Delete
  4. Hi,

    I have xml files with 100 thousand records. I want to split this xml file and each file should contain only 10,000 rows. Below is sample xml which I want to split it. Also when we split file we will have to add root elements in each file. In this case there are 3 root elements (productextract,extractdate & products) as you see. Any help would be greatly appericiated











    Thanks,
    Nish

    ReplyDelete
  5. Hello,

    I have a problem and I need help.

    Let's consider the following XML file. This file is composed of a header and 4 nodes delimited by and .
    I need to copy each node with a value equals to "05" in a file named "file_05.xml". The other node have to be filtered out.



    ...
    05
    ...


    ...
    01
    ...


    ...
    no tag in this file
    ...


    ...
    05
    ...


    Thanks to the example 7 "Split the file by inserting a header record in every new file" I know how to copy each node in a different file. But unfortunately, I have no idea to solve my problem.
    Maybe using array ?

    Can anybody help me ?

    ReplyDelete
  6. I could use some help with splitting a file using awk. I have a file with records of numbers:
    timestamp x y z qlty
    the file is segmented with lines starting with the "/" followed by a timestamp and a comment:
    / hhmmss.0 comment
    I would ike to split the file at "/" and use the timestamp as part of the new filename.

    ReplyDelete
    Replies
    1. Please post your question in Q&A with input and output examples.

      Delete
  7. I want to split a very very large file on every Nth occurrence of a pattern into smaller files using awk.

    ReplyDelete
  8. I have file like
    1,suma,1000
    2,ravi,2000
    3,suma,3000
    4,suma 4000
    but i want count suma word how many time is there
    how can please help me asap....

    ReplyDelete
  9. Hi Guru,

    I have tried #5, but the shell shows me the following message:

    "awk: estacion_year18 makes too many open files
    input record number 953, file area1.txt
    source line number 1"

    This es a very large .txt file with a common pattern, but the awk function seems to be limited to 18 files. There is a way to enlarge this limit?

    ReplyDelete
  10. awk '/Workstation/{x="F"++i;}{print > x;}' file3
    The above command works in Linux, but not in AIX as I'm getting below error. Please look into this.

    [uedcb017:twsadm /opt/Tivoli/TWA/TWS/scripts/testflat ]awk '/Workstation/{x="F"++i;}{print > x;}' file3
    awk: 0602-576 A print or getline function must have a file name.
    The input line number is 1. The file is file3.
    The source line number is 1.

    ReplyDelete
  11. If you get an error saying "too many open files" you need to close the file. Example:

    BEGIN {FS=","}
    {
    filename=$1".csv"
    print $3","$4 >> filename # append line to file
    close(filename)
    }

    ReplyDelete
  12. Hi.
    I've a file which I want to split into multiple files based on the value in the 12th column. Delimiter of my file is |
    For example, if 12th column is Anthem (or Anthem_new or Anthem_old), they should go into a file with name Anthem_member_datehere.txt (for example Anthem_member_20200513.txt).
    Lines with 12th column being anything but having Aetna in it should go to Aetna_member_datehere.txt. Similarly for BSCA.
    Can you please help me out with this one?

    ReplyDelete
  13. You refer to examples 2 and 3 and show here what is being tried. We can help from there

    ReplyDelete