Remove duplicate entries in a Bash script

Question

I want to remove duplicate entries from a text file, e.g:

kavitha= Tue Feb    20 14:00 19 IST 2012  (duplicate entry) 
sree=Tue Jan  20 14:05 19 IST 2012  
divya = Tue Jan  20 14:20 19 IST 2012  
anusha=Tue Jan 20 14:45 19 IST 2012 
kavitha= Tue Feb    20 14:00 19 IST 2012 (duplicate entry)

Is there any possible way to remove the duplicate entries using a Bash script?

Desired output

kavitha= Tue Feb    20 14:00 19 IST 2012 
sree=Tue Jan  20 14:05 19 IST 2012  
divya = Tue Jan  20 14:20 19 IST 2012  
anusha=Tue Jan 20 14:45 19 IST 2012

Ironic that this question itself is a duplicate... – sinisterstuf Sep 13 '21 at 14:26 — sinisterstuf, Sep 13 '21 at 14:26

score 487 · Accepted Answer · edited Apr 23 '16 at 19:36

487

You can sort then uniq:

$ sort -u input.txt

Or use awk:

$ awk '!a[$0]++' input.txt

edited Apr 23 '16 at 19:36

Hugo

27,885
8
82
98

answered Feb 21 '12 at 11:52

kev

155,172
47
273
272

69

Testing with an 18,500 line text file: `sort ...` takes about 0.57s whereas `awk ...` takes about 0.08s because `awk ...` just removes duplicates without sorting. – Hugo Oct 19 '13 at 12:38
5

@Hugo I can second that. Testing against 2,626,198 lines `awk` beats `sort`. Results show `awk` taking 5.675s and `sort` taking 5.675s. Interestingly enough the same record set took 15.1 seconds to perform a MySQL DISTINCT query on. – Tegan Snyder Feb 11 '16 at 19:13
@TeganSnyder You wrote the both commands took exactly same time to execute. Didn't `awk` take less time? – jarno May 17 '16 at 09:59
@jarno - my apologizes that was copy paste error on my part. I would need to recreate the test to see how much faster `awk` was, but it was negligible. – Tegan Snyder May 17 '16 at 14:40
1

@Hugo is there an elegant way to make this work with case insensitive? or is it better to just convert the entire doc to lowercase, then run this? – Onichan Jun 09 '16 at 02:55
@Onichan Try something like this: `echo -e "c\nb\nB\na" | LC_COLLATE=C -u input.txt` http://superuser.com/q/178171/83235 – Hugo Jun 09 '16 at 07:46
6

tested with 24 million rows, awk did not come to a result within 20 minutes. sort + uniq did the job in some secs. – bhelm Jul 04 '16 at 15:12
9

I downvoted this because, although poster is happy, folks could be confused by an answer that does not yield the desired output, as it sorts the input – lab419 Dec 17 '17 at 16:54
tried it like this: for i in $(ls .*ini); do awk '!a[$0]++' $i; done and it erased some files entirely – CIsForCookies Apr 04 '19 at 16:38
@bhelm Re: "24 million rows": any ideas _why_ awk did not come to a result within 20 minutes? – pmor May 15 '23 at 19:12

score 21 · Answer 2 · answered Feb 21 '12 at 11:53

21

It deletes duplicate, consecutive lines from a file (emulates "uniq").
First line in a set of duplicate lines is kept, rest are deleted.

sed '$!N; /^\(.*\)\n\1$/!P; D'

answered Feb 21 '12 at 11:53

Siva Charan

17,940
9
60
95

2

worked for me, One more addition for other use, If you want to change the file itself here is the command `sed -i '$!N; /^$.*$\n\1$/!P; D' ` – vishal sahasrabuddhe Oct 21 '15 at 06:43
2

This is awesome !! – Arnab Apr 24 '19 at 20:29
what is `$!N; `? – Geoff Langenderfer Oct 18 '21 at 17:29

Chris Koknat · Answer 3 · 2015-10-07T17:02:38.370

13

Perl one-liner similar to @kev's awk solution:

perl -ne 'print if ! $a{$_}++' input

This variation removes trailing whitespace before comparing:

perl -lne 's/\s*$//; print if ! $a{$_}++' input

This variation edits the file in-place:

perl -i -ne 'print if ! $a{$_}++' input

This variation edits the file in-place, and makes a backup input.bak

perl -i.bak -ne 'print if ! $a{$_}++' input

edited Oct 07 '15 at 17:02

answered Sep 09 '15 at 16:34

Chris Koknat

3,305
2
29
30

1

I like the Perl solution because it allows me to add extra conditions, e.g. only enforce uniqueness on lines matching a certain pattern. – Capt. Crunch Oct 11 '18 at 04:10
Is `perl -i -ne 'print if ! $a{$_}++' input` faster (in genereal) than `gawk -i inplace '!a[$0]++' input`? – pmor May 16 '23 at 18:27

score 0 · Answer 4 · answered Feb 21 '12 at 14:46

0

This might work for you:

cat -n file.txt |
sort -u -k2,7 |
sort -n |
sed 's/.*\t/    /;s/\([0-9]\{4\}\).*/\1/'

or this:

 awk '{line=substr($0,1,match($0,/[0-9][0-9][0-9][0-9]/)+3);sub(/^/,"    ",line);if(!dup[line]++)print line}' file.txt

answered Feb 21 '12 at 14:46

potong

55,640
6
51
83

Remove duplicate entries in a Bash script

4 Answers4

Linked

Related