The UNIX School: awk - Join or merge lines on finding a pattern

Monday, May 21, 2012

awk - Join or merge lines on finding a pattern

In one of our earlier articles, we had discussed about joining all lines in a file and also joining every 2 lines in a file. In this article, we will see the how we can join lines based on a pattern or joining lines on encountering a pattern using awk or gawk.

Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.

$ cat file
START
Unix
Linux
START
Solaris
Aix
SCO

1. Join the lines following the pattern START without any delimiter.

$ awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file
UnixLinux
SolarisAixSCO

Basically, what we are trying to do is: Accumulate the lines following the START and print them on encountering the next START statement. /START/ searches for lines containing the pattern START. The command within the {} will work only on lines containing the START pattern. Prints a blank line if the line is not the first line(NR!=1). Without this condition, a blank line will come in the very beginning of the output since it encounters a START in the beginning.

The next command prevents the remaining part of the command from getting executed for the START lines. The second part of braces {} works only for the lines not containing the START. This part simply prints the line without a terminating new line character(printf). And hence as a result, we get all the lines after the pattern START in the same line. The END label is put to print a newline at the end without which the prompt will appear at the end of the last line of output itself.

2. Join the lines following the pattern START with space as delimiter.

$ awk '/START/{if (NR!=1)print "";next}{printf "%s ",$0}END{print "";}' file
Unix Linux
Solaris Aix SCO

This is same as the earlier one except it uses the format specifier %s in order to accommodate an additional space which is the delimiter in this case.

3. Join the lines following the pattern START with comma as delimiter.

$ awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
Unix,Linux
Solaris,Aix,SCO

Here, we form a complete line and store it in a variable x and print the variable x whenever a new pattern starts. The command: x=(!x)?$0:x","$0 is like the ternary operator in C or Perl. It means if x is empty, assign the current line($0) to x, else append a comma and the current line to x. As a result, x will contain the lines joined with a comma following the START pattern. And in the END label, x is printed since for the last group there will not be a START pattern to print the earlier group.

4. Join the lines following the pattern START with comma as delimiter with also the pattern matching line.

$ awk '/START/{if (x)print x;x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
START,Unix,Linux
START,Solaris,Aix,SCO

The difference here is the missing next statement. Because next is not there, the commands present in the second set of curly braces are applicable for the START line as well, and hence it also gets concatenated.

5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.

$ awk '/START/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
START
Unix,Linux
START
Solaris,Aix,SCO

In this, instead of forming START as part of the variable x, the START line is printed. As a result, the START line comes out separately, and the remaining lines get joined.

8 comments:

UnknownAugust 8, 2013 at 3:16 AM
I'm a mainframe and SAS programmer trying to learn UNIX Shell Scripts:
I cannot get following command to work properly (it does not concat the records after the START delim...rather it shows last rcrd before the START and it blends last rcrd with leftover from 1st rcrd)...please help; thx:

awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file

Al Diovanni
adiovanni@earthlink.net
C#: 347.525.2501
H#: 718.987.8672
ReplyDelete
Replies
UnknownApril 9, 2015 at 5:00 PM
Hi Experts, I am trying to achieve below results. Please help:

For Inputs:
START 1
UNIX
Linux
START 2
Solaris
Aix
SCO

Output should be:
START 1~UNIX
START 1~Linux
START 2~Solaris
START 2~Aix
START 3~SCO
ReplyDelete
Replies
UnknownApril 10, 2015 at 5:51 PM
Thanks a lot Guru!
This command is working fine in all situations except when input is like this:
START 1
START 2
Unix
Linux

in this case, output is :
START 2~Unix
START 2~Linux

but expected output is :
START 1
START 2~Unix
START 3~Linux
ReplyDelete
Replies
UnknownJuly 30, 2015 at 5:00 PM
hi so i have a problem that is kind of similar to this i have this
>@1M1U7:00212:00595
_F_48_30.5625
CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT
>@1M1U7:00241:00593
_F_48_30.3958333333
CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT

and i want to get to this:
>@1M1U7:00212:00595_F_48_30.5625
CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT
>@1M1U7:00241:00593_F_48_30.3958333333
CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT
ReplyDelete
Replies
Guru PrasadJuly 30, 2015 at 11:12 PM
awk '/^>/{a=$0;getline x;$0=a;}1' file
ReplyDelete
Replies

Add comment

Pages

Monday, May 21, 2012

awk - Join or merge lines on finding a pattern

8 comments: