join command is used to join two files based on a key field present in both the files. The input files can be separated by white space or any delimiter. join can be looked at like the join functionality in any RDBMS(database). However, the only difference being in the database join we get only the columns which we give in the select clause, in case of unix join, you get all the columns of both the files. And yes, the join can be done using only one key field, not multiple.
$ cat a1 251308 30000 251311 25000 251315 10000 251321 12255
$ cat a2 251308 Preeti 251311 Joseph 251315 Abishek 251321 AnkitaTo join these 2 files based on the first column(account number), the following is the command:
$ join a1 a2 251308 30000 Preeti 251311 25000 Joseph 251315 10000 Abishek 251321 12255 Ankita
Nice result, isnt it? By the way, we never specified the key or the columns on which we want the join to work upon? We still got the expected result becuase join command, by default, takes the first column as the key to join.
The output, if you notice, contains the key, followed by all the matching columns from the first file a1, followed by all the columns of the second file, a2.
Note: When using join command, both the input files should be sorted on the KEY on which we are going to join the files.
2. Most of the times, files contain some delimiter to separate the columns. Let us update the files with comma delimiter.
$ cat a1 251308, 30000 251311, 25000 251315, 10000 251321, 12255 251325, 45000
$ cat a2 251308, Preeti 251311, Joseph 251315, Abishek 251321, Ankita 251326, Babuta
$ join -t, a1 a2 251308, 30000, Preeti 251311, 25000, Joseph 251315, 10000, Abishek 251321, 12255, AnkitaLike the sort command, "-t" is the option we use to specify the delimiter. Since comma is the delimiter, we specify it along with -t.
Note: In case of files having delimiter, to use the "join" command, both the files should contain the same delimiter.
3. What if the key on which the join is to happen is not present in the first column?
Let us swap the columns of the file a1.
$ cat a1 251308, 30000 251311, 25000 251315, 10000 251321, 12255The join command for this one would be:
$ join -t"," -1 2 -2 1 a1 a2 251308, 30000, Preeti 251311, 25000, Joseph 251315, 10000, Abishek 251321, 12255, AnkitaLook at the options "-1 2" and "-2 1".
-1 2 : use the second column(2) as the key for join in the first file(-1)
-2 1 : use the first column(1) as the key for join in the second file(-2)
So, whenever we have files in which the key is not the first column, this is how the join needs to be done.
Note: Keep in mind, in the above example, we need not have specified the key for the second file since the key is the first column itself. The below example will clarify this:
$ join -t"," -1 2 a1 a2 251308, 30000, Preeti 251311, 25000, Joseph 251315, 10000, Abishek 251321, 12255, Ankita4. Sometimes, we can have files where a file can have a record for which there is no matching record in the other file.
Let us see this case with an example.
$ cat a1 30000,251308 25000,251311 10000,251315 12255,251321 40000,251325
$ cat a2 251308, Preeti 251311, Joseph 251315, Abishek 251321, Ankita 251326, SuviThe last record in both the files does not have a matching record in the other files.
Now, lets try to execute the same join command as earlier:
$ join -t"," -1 2 a1 a2 251308, 30000, Preeti 251311, 25000, Joseph 251315, 10000, Abishek 251321, 12255, AnkitaSame output!! Join command, by default, displays only those records for which the matches are found. The records without match are discarded.
join command can also be simulated to do kind of outer join which we do in Oracle. In outer joins, the records for which there is no matching record, the missing fields will be displayed blank.
$ join -t"," -1 2 -a 1 -a2 a1 a2 251308, 30000, Preeti 251311, 25000, Joseph 251315, 10000, Abishek 251321, 12255, Ankita 251325, 40000 251326, Suvi-a option is to display the unmatchable or unpairable lines as well.
This output, if you notice, is not very user friendly, especially the last record. The name came in place of the balance field since there is no balance record present in file a1.
To make the output more readable:
$ join -t"," -1 2 -a 1 -a2 -e ' NULL' -o '0,1.1,2.2' a1 a2 251308, 30000, Preeti 251311, 25000, Joseph 251315, 10000, Abishek 251321, 12255, Ankita 251325, 40000,NULL 251326,NULL, SuviDoes not this look like a proper database output with all the missing fields replaced with NULL!!! Let us understand what these options are for:
-e : tells what to substitute in case of missing fields, NULL in this case. does not work without -o.
-o : tells for which all missing fields this should be substituted.
'0,1.1,2.2' - 0 means the key field
1.1 - the first field of first file
2.2 - the seconf field of the second file
5. join gives you one more option with which we can only see the records for which there is no match:
$ join -t, -1 2 -v1 a1 a2 251325, 40000-v1 indicates to display the non-matching records of file1.
Similarly, for file 2:
$ join -t, -1 2 -v2 a1 a2 251326, SuviHappy joining!!!
No comments:
Post a Comment