How to find the duplicate records / lines from a file in Linux?
Let us consider a file with the following contents. The duplicate record here is 'Linux'.
$ cat file Unix Linux Solaris AIX LinuxLet us now see the different ways to find the duplicate record.
1. Using sort and uniq:
$ sort file | uniq -d Linuxuniq command has an option "-d" which lists out only the duplicate records. sort command is used since the uniq command works only on sorted files. uniq command without the "-d" option will delete the duplicate records.
2. awk way of fetching duplicate lines:
$ awk '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}' file LinuxUsing awk's asssociative array, every record is stored as index and the value is the count of the number of times the record appears in the file. At the end, only those records are printed whose count is more than 1 which indicates duplicate record.
3. Using perl way:
$ perl -ne '$h{$_}++;END{foreach (keys%h){print $_ if $h{$_} > 1;}}' file LinuxThis method is almost same as the earlier awk, the only thing being hashes being used here.
4. Another perl way:
$ perl -ne '$x=$_; if(grep(/^$x$/,@arr)){print;} else {push @arr,$x;}' file LinuxEvery time a record is being read, it is being searched in the array "arr". If present, the record is printed since its a duplicate record, else array is updated with this record.
5. A shell script to fetch / find duplicate records:
#!/bin/bash TEMP="temp"`date '+%d%m%Y%H%M%S'` TEMP1="temp1"`date '+%d%m%Y%H%M%S'` touch $TEMP $TEMP1 while read line do if grep -q "$line" $TEMP then echo $line >> $TEMP1 else echo $line >> $TEMP fi done < $1 cat $TEMP1 \rm $TEMP $TEMP1
Not the efficient of solutions since it involves multiple grep. The file is read using a while loop. 2 temporary files are maintained, one to contain duplicate records(TEMP2), other to hold other records(TEMP1). Once a line is read, it is checked in the TEMP1, if present, TEMP2 is updated, else TEMP1 is updated. In this way, TEMP2 will contain only the duplicate records in the end.
awk way :-
ReplyDeleteawk 'a[$0]++' file
Very Sort and useful command.
DeleteThis is very useful for learners, thank you
ReplyDeleteHi Friends,
ReplyDeletei have a file data is
1,abc,123,abcd
1,abc,123,ab
cd
1,efg,123,cdef
i want below output data, how can i do this
1,abc,123,abcd
1,abc,123,abcd
1,efg,123,cdef