Tuesday, January 24, 2012

join command




   join command is used to join  two files based on a key field present in both the files. The input files can be separated by white space or any delimiter. join can be looked at like the join functionality in any RDBMS(database). However, the only difference being in the database join we get only the columns which we give in the select clause, in case of unix join, you get all the columns of both the files. And yes, the join can be done using only one key field, not multiple.

1. Let us consider 2 sample files a1 and a2. a1 containing bank account numbers and the balance in the account. a2 containing bank account numbers and the account names.


$ cat a1
251308  30000
251311  25000
251315  10000
251321  12255
$ cat a2
251308  Preeti
251311  Joseph
251315  Abishek
251321  Ankita
 To join these 2 files based on the first column(account number), the following is the command:
$ join a1 a2
251308 30000 Preeti
251311 25000 Joseph
251315 10000 Abishek
251321 12255 Ankita

     Nice result, isnt it? By the way, we never specified the key or the columns on which we want the join to work upon? We still got the expected result becuase join command, by default, takes the first column as the key to join.

   The output, if you notice, contains the key, followed by all the matching columns from the first file a1, followed by all the columns of the second file, a2.

Note: When using join command, both the input files should be sorted on the KEY on which we are going to join the files.

2. Most of the times, files contain some delimiter to separate the columns. Let us update the files with comma delimiter.
$ cat a1
251308, 30000
251311, 25000
251315, 10000
251321, 12255
251325, 45000
$ cat a2
251308, Preeti
251311, Joseph
251315, Abishek
251321, Ankita
251326, Babuta
$ join -t, a1 a2
251308, 30000, Preeti
251311, 25000, Joseph
251315, 10000, Abishek
251321, 12255, Ankita
   Like the sort command, "-t" is the option we use to specify the delimiter. Since comma is the delimiter, we specify it along with -t.

Note: In case of files having delimiter, to use the "join" command, both the files  should contain the same delimiter.

3. What if the key on which the join is to happen is not present in the first column?

Let us swap the columns of the file a1.
$ cat a1
251308, 30000
251311, 25000
251315, 10000
251321, 12255
The join command for this one would be:
$ join -t"," -1 2 -2 1 a1 a2
251308, 30000, Preeti
251311, 25000, Joseph
251315, 10000, Abishek
251321, 12255, Ankita
   Look at the options  "-1 2" and "-2 1".
      -1 2 : use the second column(2) as the key for join in the first file(-1)
      -2 1 : use the first column(1) as the key for join in the second file(-2)

  So, whenever we have files in which the key is not the first column, this is how the join needs to be done.

Note: Keep in mind, in the above example, we need not have specified the key for the second file since the key is the first column itself.  The below example will clarify this:
$ join -t"," -1 2  a1 a2
251308, 30000, Preeti
251311, 25000, Joseph
251315, 10000, Abishek
251321, 12255, Ankita
4. Sometimes, we can have files where a file can have a record for which there is no matching record in the other file.

Let us see this case with an example.
$ cat a1
 30000,251308
 25000,251311
 10000,251315
 12255,251321
 40000,251325
$ cat a2
251308, Preeti
251311, Joseph
251315, Abishek
251321, Ankita
251326, Suvi
   The last record in both the files does not have a matching record in the other files.

Now, lets try to execute the same join command as earlier:
$ join -t"," -1 2  a1 a2
251308, 30000, Preeti
251311, 25000, Joseph
251315, 10000, Abishek
251321, 12255, Ankita
  Same output!!  Join command, by default, displays only those records for which the matches are found. The records without match are discarded.

  join command can also be simulated to do kind of outer join which we do in Oracle. In outer joins, the records for which there is no matching record, the missing fields will be displayed blank.
$ join -t"," -1 2 -a 1 -a2  a1 a2
251308, 30000, Preeti
251311, 25000, Joseph
251315, 10000, Abishek
251321, 12255, Ankita
251325, 40000
251326, Suvi
  -a option is to display the unmatchable or unpairable lines as well.

  This output, if you notice, is not very user friendly, especially the last record. The name came in place of the balance field since there is no balance record present in file a1.

To make the output more readable:
$ join -t"," -1 2 -a 1 -a2  -e ' NULL' -o '0,1.1,2.2' a1 a2
251308, 30000, Preeti
251311, 25000, Joseph
251315, 10000, Abishek
251321, 12255, Ankita
251325, 40000,NULL
251326,NULL, Suvi
  Does not this look like a proper database output with all the missing fields replaced with NULL!!! Let us understand what these options are for:
    -e : tells what to substitute in case of missing fields, NULL in this case. does not work without -o.
    -o : tells for which all missing fields this should be substituted.
         '0,1.1,2.2' - 0 means the key field
                      1.1 - the first field of first file
                      2.2 - the seconf field of the second file

5. join gives you one more option with which we can only see the records for which there is no match:
$ join -t, -1 2 -v1 a1 a2
251325, 40000
    -v1 indicates to display the non-matching records of file1.

Similarly, for file 2:
$ join -t, -1 2 -v2 a1 a2
251326, Suvi
Happy joining!!!

No comments:

Post a Comment