The UNIX School: How to find duplicate records of a file in Linux?

Wednesday, October 3, 2012

How to find duplicate records of a file in Linux?

How to find the duplicate records / lines from a file in Linux?
Let us consider a file with the following contents. The duplicate record here is 'Linux'.

$ cat file
Unix
Linux
Solaris
AIX
Linux

Let us now see the different ways to find the duplicate record.

1. Using sort and uniq:

$ sort file | uniq -d
Linux

uniq command has an option "-d" which lists out only the duplicate records. sort command is used since the uniq command works only on sorted files. uniq command without the "-d" option will delete the duplicate records.

2. awk way of fetching duplicate lines:

$ awk '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}' file
Linux

Using awk's asssociative array, every record is stored as index and the value is the count of the number of times the record appears in the file. At the end, only those records are printed whose count is more than 1 which indicates duplicate record.

3. Using perl way:

$ perl -ne '$h{$_}++;END{foreach (keys%h){print $_ if $h{$_} > 1;}}' file
Linux

This method is almost same as the earlier awk, the only thing being hashes being used here.

4. Another perl way:

$ perl -ne '$x=$_; if(grep(/^$x$/,@arr)){print;} else {push @arr,$x;}' file
Linux

Every time a record is being read, it is being searched in the array "arr". If present, the record is printed since its a duplicate record, else array is updated with this record.

5. A shell script to fetch / find duplicate records:

#!/bin/bash

TEMP="temp"`date '+%d%m%Y%H%M%S'`
TEMP1="temp1"`date '+%d%m%Y%H%M%S'`
touch $TEMP $TEMP1

while read line
do
    if grep -q "$line" $TEMP
    then
             echo $line >> $TEMP1
    else
             echo $line >> $TEMP
    fi
done < $1

cat $TEMP1
\rm $TEMP $TEMP1

Not the efficient of solutions since it involves multiple grep. The file is read using a while loop. 2 temporary files are maintained, one to contain duplicate records(TEMP2), other to hold other records(TEMP1). Once a line is read, it is checked in the TEMP1, if present, TEMP2 is updated, else TEMP1 is updated. In this way, TEMP2 will contain only the duplicate records in the end.

4 comments:

ankitApril 16, 2015 at 11:52 AM
awk way :-

awk 'a[$0]++' file
ReplyDelete
Replies
ch.jaineesh reddyNovember 24, 2015 at 8:46 PM
This is very useful for learners, thank you
ReplyDelete
Replies
UnknownAugust 4, 2016 at 7:20 PM
Hi Friends,

i have a file data is

1,abc,123,abcd
1,abc,123,ab
cd
1,efg,123,cdef

i want below output data, how can i do this

1,abc,123,abcd
1,abc,123,abcd
1,efg,123,cdef
ReplyDelete
Replies

Add comment

Pages

Wednesday, October 3, 2012

How to find duplicate records of a file in Linux?

4 comments: