The UNIX School: awk - 10 examples to split a file into multiple files

Thursday, June 28, 2012

awk - 10 examples to split a file into multiple files

In this article of the awk series, we will see the different scenarios in which we need to split a file into multiple files using awk. The files can be split into multiple files either based on a condition, or based on a pattern or because the file is big and hence needs to split into smaller files.

Sample File1:
Let us consider a sample file with the following contents:

$ cat file1
Item1,200
Item2,500
Item3,900
Item2,800
Item1,600

1. Split the file into 3 different files, one for each item. i.e, All records pertaining to Item1 into a file, records of Item2 into another, etc.

$ awk -F, '{print > $1}' file1

The files generated by the above command are as below:

$ cat Item1
Item1,200
Item1,600

$ cat Item3
Item3,900

$ cat Item2
Item2,500
Item2,800

This looks so simple, right? print prints the entire line, and the line is printed to a file whose name is $1, which is the first field. This means, the first record will get written to a file named 'Item1', and the second record to 'Item2', third to 'Item3', 4th goes to 'Item2', and so on.

2. Split the files by having an extension of .txt to the new file names.

$ awk -F, '{print > $1".txt"}' file1

The only change here from the above is concatenating the string ".txt" to the $1 which is the first field. As a result, we get the extension to the file names. The files created are below:

$ ls *.txt
Item2.txt  Item1.txt  Item3.txt

3. Split the files by having only the value(the second field) in the individual files, i.e, only 2nd field in the new files without the 1st field:

$ awk -F, '{print $2 > $1".txt"}' file1

The print command prints the entire record. Since we want only the second field to go to the output files, we do: print $2.

$ cat Item1.txt
200
600

4. Split the files so that all the items whose value is greater than 500 are in the file "500G.txt", and the rest in the file "500L.txt".

$ awk -F, '{if($2<=500)print > "500L.txt";else print > "500G.txt"}' file1

The output files created will be as below:

$ cat 500L.txt
Item1,200
Item2,500

$ cat 500G.txt
Item3,900
Item2,800
Item1,600

Check the second field($2). If it is lesser or equal to 500, the record goes to "500L.txt", else to "500G.txt".
Other way to achieve the same thing is using the ternary operator in awk:

$ awk -F, '{x=($2<=500)?"500L.txt":"500G.txt"; print > x}' file1

The condition for greater or lesser than 500 is checked and the appropriate file name is assigned to variable x. The record is then written to the file present in the variable x.

Sample File2:
Let us consider another file with a different set of contents. This file has a pattern 'START' at frequent intervals.

$ cat file2
START
Unix
Linux
START
Solaris
Aix
SCO

5. Split the file into multiple files at every occurrence of the pattern START .

$ awk '/START/{x="F"++i;}{print > x;}' file2

This command contains 2 sets of curly braces: The control goes to the first set of braces only on encountering a line containing the pattern START. The second set will be encountered by every line since there is no condition and hence always true.

On encountering the pattern START, a new file name is created and stored. When the first START comes, x will contain "F1" and the control goes to the next set of braces and the record is written to F1, and the subsequent records go the file "F1" till the next START comes. On encountering next START, x will contain "F2" and the subsequent lines goes to "F2" till the next START, and it continues.

$ cat F1
START
Unix
Linux
Solaris

$ cat F2
START
Aix
SCO

6. Split the file into multiple files at every occurrence of the pattern START. But the line containing the pattern should not be in the new files.

$ awk '/START/{x="F"++i;next}{print > x;}' file2

The only difference in this from the above is the inclusion of the next command. Due to the next command, the lines containing the START enters the first curly braces and then starts reading the next line immediately due to the next command. As a result, the START lines does not get to the second curly braces and hence the START does not appear in the split files.

$ cat F1
Unix
Linux

$ cat F2
Solaris
Aix
SCO

7. Split the file by inserting a header record in every new file.

$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print > x;}' file2

The change here from the earlier one is this: Before the next command, we write the header record into the file. This is the right place to write the header record since this is where the file is created first.

$ cat F1
ANY HEADER
Unix
Linux

$ cat F2
ANY HEADER
Solaris
Aix
SCO

Sample File3:
Let us consider a file with the sample contents:

$ cat file3
Unix
Linux
Solaris
AIX
SCO

8. Split the file into multiple files at every 3rd line . i.e, First 3 lines into F1, next 3 lines into F2 and so on.

$ awk 'NR%3==1{x="F"++i;}{print > x}'  file3

In other words, this is nothing but splitting the file into equal parts. The condition does the trick here: NR%3==1 : NR is the line number of the current record. NR%3 will be equal to 1 for every 3rd line such as 1st, 4th, 7th and so on. And at every 3rd line, the file name is changed in the variable x, and hence the records are written to the appropriate files.

$ cat F1
Unix
Linux
Solaris

$ cat F2
Aix
SCO

Sample File4:
Let us update the above file with a header and trailer:

$ cat file4
HEADER
Unix
Linux
Solaris
AIX
SCO
TRAILER

9. Split the file at every 3rd line without the header and trailer in the new files.

sed '1d;$d;' file4 | awk 'NR%3==1{x="F"++i;}{print > x}'

The earlier command does the work for us, only thing is to pass to the above command without the header and trailer. sed does it for us. '1d' is to delete the 1st line, '$d' to delete the last line.

$ cat F1
Unix
Linux
Solaris

$ cat F2
AIX
SCO

10. Split the file at every 3rd line, retaining the header and trailer in every file.

$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print f>x;}{print > x}END{for(j=1;j<i;j++)print> a[j];}' file4

This one is little tricky. Before the file is processed, the first line is read using getline into the variable f. NR%3 is checked with 2 instead of 1 as in the earlier case because since the first line is a header, we need to split the files at 2nd, 5th, 8th lines, and so on. All the file names are stored in the array "a" for later processing.

Without the END label, all the files will have the header record, but only the last file will have the trailer record. So, the END label is to precisely write the trailer record to all the files other than the last file.

$ cat F1
HEADER
Unix
Linux
Solaris
TRAILER

$ cat F2
HEADER
Aix
SCO
TRAILER

18 comments:

AnonymousMarch 1, 2013 at 4:31 AM
hi
is there an awk command to split a file in terms of bytes.I had a file of 300 MB and I wanted only the first 100 mb of that file.I used
split -b 100M filename newfile.
However it split into 3 files of size 104857600 ,104857600 and 9666150 .
they were not exactly 100 mb size not even the first one.Is there any option?
ReplyDelete
Replies
Deepesh PoomadathilJune 29, 2013 at 10:30 PM
Hi,
In the first example , "awk -F, '{print > $1}' file1", can you tell me how we can handle two scenarios 1. handling case sensitivity of the field which we use for splitting . (Eg : Item1 , item1,ITEM1 should go a same file)
2. Handling NULL values. If NULL values are coming the records are skipped and will not generate a new file for NULL.
-- Deepesh
ReplyDelete
Replies
Ratna sekhar ChowdaryAugust 27, 2013 at 12:02 PM
Hi,

I have a file in the below format:
::::::::::::::
AMS-PRIV-10
::::::::::::::
AUTRE-10009608
AVIDT-10009807
BBTAC-TEST-VRF
BLKBD-10005163
COMME-10009455
CSION-10003183
CTESO-10011350
DCDCC-10005576
ECHSA-10006326
FUSEN-10010801
GEHLC-10007350
GLABS-10005004
GLOBE-10001386
GSADM-10009406
::::::::::::::
ANC-PRIV-10
::::::::::::::
BBTAC-TEST-VRF
BURCF-10005877
CLSIN-10007729
DOJNX-10005331
DPTVA-10001225
DPTVA-15001225
GABRL-10008510
GLOWP-10006019
::::::::::::::

So here I need to split this file in to two files. The first file name should be the "AMS-PRIV-10" and the content should be the lines present between the "::::::::::::::" and the next file name is "ANC-PRIV-10" and the content should be the lines present till the end of "::::::::::::"

Awaiting your response.

Thanks,
-Venkat
ReplyDelete
Replies
UnknownNovember 2, 2013 at 12:03 AM
Hi,

I have xml files with 100 thousand records. I want to split this xml file and each file should contain only 10,000 rows. Below is sample xml which I want to split it. Also when we split file we will have to add root elements in each file. In this case there are 3 root elements (productextract,extractdate & products) as you see. Any help would be greatly appericiated

Thanks,
Nish
ReplyDelete
Replies
AnonymousDecember 4, 2013 at 8:24 PM
Hello,

I have a problem and I need help.

Let's consider the following XML file. This file is composed of a header and 4 nodes delimited by and .
I need to copy each node with a value equals to "05" in a file named "file_05.xml". The other node have to be filtered out.

...
05
...

...
01
...

...
no tag in this file
...

...
05
...

Thanks to the example 7 "Split the file by inserting a header record in every new file" I know how to copy each node in a different file. But unfortunately, I have no idea to solve my problem.
Maybe using array ?

Can anybody help me ?
ReplyDelete
Replies
lokiDecember 5, 2013 at 3:13 PM
I could use some help with splitting a file using awk. I have a file with records of numbers:
timestamp x y z qlty
the file is segmented with lines starting with the "/" followed by a timestamp and a comment:
/ hhmmss.0 comment
I would ike to split the file at "/" and use the timestamp as part of the new filename.
ReplyDelete
Replies
sameer.oakFebruary 5, 2014 at 6:05 PM
I want to split a very very large file on every Nth occurrence of a pattern into smaller files using awk.
ReplyDelete
Replies
UnknownMarch 19, 2014 at 11:21 AM
I have file like
1,suma,1000
2,ravi,2000
3,suma,3000
4,suma 4000
but i want count suma word how many time is there
how can please help me asap....
ReplyDelete
Replies
UnknownOctober 2, 2015 at 9:18 AM
Hi Guru,

I have tried #5, but the shell shows me the following message:

"awk: estacion_year18 makes too many open files
input record number 953, file area1.txt
source line number 1"

This es a very large .txt file with a common pattern, but the awk function seems to be limited to 18 files. There is a way to enlarge this limit?
ReplyDelete
Replies
satishJuly 18, 2017 at 8:21 PM
awk '/Workstation/{x="F"++i;}{print > x;}' file3
The above command works in Linux, but not in AIX as I'm getting below error. Please look into this.

[uedcb017:twsadm /opt/Tivoli/TWA/TWS/scripts/testflat ]awk '/Workstation/{x="F"++i;}{print > x;}' file3
awk: 0602-576 A print or getline function must have a file name.
The input line number is 1. The file is file3.
The source line number is 1.
ReplyDelete
Replies
AdamJanuary 9, 2019 at 10:43 PM
If you get an error saying "too many open files" you need to close the file. Example:

BEGIN {FS=","}
{
filename=$1".csv"
print $3","$4 >> filename # append line to file
close(filename)
}
ReplyDelete
Replies
Gaurav JhaMay 13, 2020 at 7:28 PM
Hi.
I've a file which I want to split into multiple files based on the value in the 12th column. Delimiter of my file is |
For example, if 12th column is Anthem (or Anthem_new or Anthem_old), they should go into a file with name Anthem_member_datehere.txt (for example Anthem_member_20200513.txt).
Lines with 12th column being anything but having Aetna in it should go to Aetna_member_datehere.txt. Similarly for BSCA.
Can you please help me out with this one?
ReplyDelete
Replies
Guru PrasadMay 14, 2020 at 11:22 AM
You refer to examples 2 and 3 and show here what is being tried. We can help from there
ReplyDelete
Replies

Add comment

Pages

Thursday, June 28, 2012

awk - 10 examples to split a file into multiple files

18 comments: