Monday, May 21, 2012

awk - Join or merge lines on finding a pattern



 In one of our earlier articles, we had discussed about joining all lines in a file and also joining every 2 lines in a file. In this article, we will see the how we can join lines based on a pattern or joining lines on encountering a pattern using awk or gawk.

Let us assume a file with the following contents. There is a line with START in-between. We have to join all the lines following the pattern START.
$ cat file
START
Unix
Linux
START
Solaris
Aix
SCO
1. Join the lines following the pattern START without any delimiter.
$ awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file
UnixLinux
SolarisAixSCO
    Basically, what we are trying to do is:  Accumulate the lines following the START and print them on encountering the next START statement. /START/ searches for lines containing the pattern START.  The command within the {} will work only on lines containing the START pattern. Prints a blank line if the line is not the first line(NR!=1). Without this condition, a blank line will come in the very beginning of the output since it encounters a START in the beginning. 

   The next command prevents the remaining part of the command from getting executed for the START lines. The second part of braces {} works only for the lines not containing the START. This part simply prints the line without a terminating new line character(printf). And hence as a result, we get all the lines after the pattern START in the same line. The END label is put to print a newline at the end without which the prompt will appear at the end of the last line of output itself.

2. Join the lines following the pattern START with space as delimiter.
$ awk '/START/{if (NR!=1)print "";next}{printf "%s ",$0}END{print "";}' file
Unix Linux
Solaris Aix SCO
    This is same as the earlier one except it uses the format specifier %s in order to accommodate an additional space which is the delimiter in this case.

3. Join the lines following the pattern START with comma as delimiter.
$ awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
Unix,Linux
Solaris,Aix,SCO
    Here, we form a complete line and store it in a variable x and print the variable x whenever a new pattern starts. The command: x=(!x)?$0:x","$0 is like the ternary operator in C or Perl. It means if x is empty, assign the current line($0) to x, else append a comma and the current line to x. As a result, x will contain the lines joined with a comma following the START pattern. And in the END label, x is printed since for the last group there will not be a START pattern to print the earlier group.

4. Join the lines following the pattern START with comma as delimiter with also the pattern matching line.
$ awk '/START/{if (x)print x;x="";}{x=(!x)?$0:x","$0;}END{print x;}' file
START,Unix,Linux
START,Solaris,Aix,SCO
     The difference here is the missing next statement. Because next is not there, the commands present in the second set of curly braces are applicable for the START line as well, and hence it also gets concatenated.

5. Join the lines following the pattern START with comma as delimiter with also the pattern matching line. However, the pattern line should not be joined.
$ awk '/START/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file
START
Unix,Linux
START
Solaris,Aix,SCO
In this, instead of forming START as part of the variable x, the START line is printed. As a result, the START line comes out separately, and the remaining lines get joined.

8 comments:

  1. I'm a mainframe and SAS programmer trying to learn UNIX Shell Scripts:
    I cannot get following command to work properly (it does not concat the records after the START delim...rather it shows last rcrd before the START and it blends last rcrd with leftover from 1st rcrd)...please help; thx:

    awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file

    Al Diovanni
    adiovanni@earthlink.net
    C#: 347.525.2501
    H#: 718.987.8672

    ReplyDelete
    Replies
    1. Looks like your file contains ^M characters. Run the dos2unix command on your file before running the awk command.

      Delete
    2. Thanks for your help: i got it to work. I could not install dos2unix (using yum) because i don't yet know how to get access to my fedora linux root directory; however, someone gave me this dos2unix equivalent command which fixed my dos text input file:
      tr -d '\r' < awk_merge_join_input_file > awk_merge_join_input_file_new

      So now when I do:

      awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}' file

      I get the records sandwiched between the START pattern delimiters properly concatenated onto one line.

      Thanks !

      Delete
  2. Hi Experts, I am trying to achieve below results. Please help:

    For Inputs:
    START 1
    UNIX
    Linux
    START 2
    Solaris
    Aix
    SCO

    Output should be:
    START 1~UNIX
    START 1~Linux
    START 2~Solaris
    START 2~Aix
    START 3~SCO

    ReplyDelete
  3. Thanks a lot Guru!
    This command is working fine in all situations except when input is like this:
    START 1
    START 2
    Unix
    Linux

    in this case, output is :
    START 2~Unix
    START 2~Linux

    but expected output is :
    START 1
    START 2~Unix
    START 3~Linux

    ReplyDelete
  4. hi so i have a problem that is kind of similar to this i have this
    >@1M1U7:00212:00595
    _F_48_30.5625
    CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT
    >@1M1U7:00241:00593
    _F_48_30.3958333333
    CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT

    and i want to get to this:
    >@1M1U7:00212:00595_F_48_30.5625
    CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT
    >@1M1U7:00241:00593_F_48_30.3958333333
    CAATGGGAAATCTTAGGCACTTCTTCCGGCGAATTTCGCGCCATTTCT

    ReplyDelete
  5. awk '/^>/{a=$0;getline x;$0=a;}1' file

    ReplyDelete