How to remove / delete duplicate records / lines from a file?
Let us consider a file with the following content. The duplicate record is 'Linux' with 2 entries :
$ cat file Unix Linux Solaris AIX Linux1. Using sort and uniq:
$ sort file | uniq AIX Linux Solaris Unixuniq command retains only unique records from a file. In other words, uniq removes duplicates. However, uniq command needs a sorted file as input.
2. Only the sort command without uniq command:
$ sort -u file AIX Linux Solaris Unixsort with -u option removes all the duplicate records and hence uniq is not needed at all.
Without changing order of contents:
The above 2 methods change the order of the file. The unique records may not be in the order in which it appears in the file. The below 2 methods will print the file without duplicates in the same order in which it was present in the file.
3. Using the awk :
$ awk '!a[$0]++' file Unix Linux Solaris AIX
This is very tricky. awk uses associative arrays to remove duplicates here. When a pattern appears for the 1st time, count for the pattern is incremented. This will still make the count as 0 since it is a post-fix, and the negation of 0 which is 'True' makes the pattern printed. When the same pattern appears again, the count is now 1 and hence the inverse is 'False' and hence the pattern does not get printed.
4. Perl solution:
$ perl -lne '$x=$_;if(!grep(/^$x$/,@arr)){print; push @arr,$_ ;}' file Unix Linux Solaris AIXEvery time before printing a pattern, the pattern is checked in the array "arr". If not present, the pattern is printed and also pushed into the array "arr" so that the pattern does not get printed the next time.
5. A shell script to remove duplicates:
$ cat dupl.sh #!/bin/bash TEMP="temp"`date '+%d%m%Y%H%M%S'` touch $TEMP while read line do grep -q "$line" $TEMP || echo $line >> $TEMP done < $1 cat $TEMP \rm $TEMPThe input file contents are read using the while loop. Within the loop, every pattern is written to a temporary file if the pattern is not present in it. And hence, the temporary file contains a copy of the original file without duplicates.
On running the above script:
$ ./dupl.sh file Unix Linux Solaris AIX
No comments:
Post a Comment