Wednesday, October 3, 2012

How to find duplicate records of a file in Linux?



How to find the duplicate records / lines from a file in Linux?
  Let us consider a file with the following contents. The duplicate record here is 'Linux'.
$ cat file
Unix
Linux
Solaris
AIX
Linux
Let us now see the different ways to find the duplicate record.

1. Using sort and uniq:
$ sort file | uniq -d
Linux
   uniq command has an option "-d" which lists out only the duplicate records. sort command is used since the uniq command works only on sorted files. uniq command without the "-d" option will delete the duplicate records.

2. awk way of fetching duplicate lines:
$ awk '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}' file
Linux
 Using awk's asssociative array, every record is stored as index and the value is the count of the number of times the record appears in the file. At the end, only those records are printed whose count is more than 1 which indicates duplicate record.

3. Using perl way:
$ perl -ne '$h{$_}++;END{foreach (keys%h){print $_ if $h{$_} > 1;}}' file
Linux
   This method is almost same as the earlier awk, the only thing being hashes being used here.

4. Another perl way:
$ perl -ne '$x=$_; if(grep(/^$x$/,@arr)){print;} else {push @arr,$x;}' file
Linux
  Every time a record is being read, it is being searched in the array "arr". If present, the record is printed since its a duplicate record, else array is updated with this record.

5. A shell script to fetch / find duplicate records:
#!/bin/bash

TEMP="temp"`date '+%d%m%Y%H%M%S'`
TEMP1="temp1"`date '+%d%m%Y%H%M%S'`
touch $TEMP $TEMP1

while read line
do
    if grep -q "$line" $TEMP
    then
             echo $line >> $TEMP1
    else
             echo $line >> $TEMP
    fi
done < $1

cat $TEMP1
\rm $TEMP $TEMP1
 Not the efficient of solutions since it involves multiple grep. The file is read using a while loop.  2 temporary files are maintained, one to contain duplicate records(TEMP2), other to hold other records(TEMP1). Once a line is read, it is checked in the TEMP1, if present, TEMP2 is updated, else TEMP1 is updated. In this way, TEMP2 will contain only the duplicate records in the end.

4 comments:

  1. This is very useful for learners, thank you

    ReplyDelete
  2. Hi Friends,

    i have a file data is

    1,abc,123,abcd
    1,abc,123,ab
    cd
    1,efg,123,cdef

    i want below output data, how can i do this

    1,abc,123,abcd
    1,abc,123,abcd
    1,efg,123,cdef

    ReplyDelete