Thursday, September 27, 2012

How to remove duplicate records from a file in Linux?



How to remove / delete duplicate records / lines from a file?

  Let us consider a file with the following content. The duplicate record is 'Linux' with 2 entries :
$ cat file
Unix
Linux
Solaris
AIX
Linux
  1. Using sort and uniq:
$ sort file | uniq
AIX
Linux
Solaris
Unix 
  uniq command retains only unique records from a file. In other words, uniq removes duplicates. However, uniq command needs a sorted file as input.

2. Only the sort command without uniq command:
$ sort -u file
AIX
Linux
Solaris
Unix
   sort with -u option removes all the duplicate records and hence uniq is not needed at all.

Without changing order of contents:
   The above 2 methods change the order of the file. The unique records may not be in the order in which it appears in the file.  The below 2 methods will print the file without duplicates in the same order in which it was present in the file.

3. Using the awk :
$ awk '!a[$0]++' file
Unix
Linux
Solaris
AIX
    This is very tricky. awk uses associative arrays to remove duplicates here. When a pattern appears for the 1st time, count for the pattern is incremented. This will still make the count as  0 since it is a post-fix, and the negation of 0 which is 'True' makes the pattern printed. When the same pattern appears again, the count is now 1 and hence the inverse is 'False' and hence the pattern does not get printed.

4. Perl solution:
$ perl -lne '$x=$_;if(!grep(/^$x$/,@arr)){print; push @arr,$_ ;}' file
Unix
Linux
Solaris
AIX
   Every time before printing a pattern, the pattern is checked in the array "arr". If not present, the pattern is printed and also pushed into the array "arr" so that the pattern does not get printed the next time.

5. A shell script to remove duplicates:
$ cat dupl.sh
#!/bin/bash

TEMP="temp"`date '+%d%m%Y%H%M%S'`
touch $TEMP

while read line
do
    grep -q "$line" $TEMP ||  echo $line >> $TEMP
done < $1

cat $TEMP
\rm $TEMP
   The input file contents are read using the while loop. Within the loop, every pattern is written to a temporary file if the pattern is not present in it. And hence, the temporary file contains a copy of the original file without duplicates.

On running the above script:
$ ./dupl.sh file
Unix
Linux
Solaris
AIX

No comments:

Post a Comment