How to remove duplicate lines from a file

Question

I have a tool that generates tests and predicts the output. The idea is that if I have a failure I can compare the prediction to the actual output and see where they diverged. The problem is the actual output contains some lines twice, which confuses diff. I want to remove the duplicates, so that I can compare them easily. Basically, something like sort -u but without the sorting.

Is there any unix command line tool that can do this?

Possible duplicate of [How can I delete duplicate lines in a file in Unix?](http://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix) — Ciro Santilli OurBigBook.com, Aug 08 '16 at 08:43

score 25 · Answer 1 · answered Apr 14 '09 at 08:09

25

Complementary to the uniq answers, which work great if you don't mind sorting your file first. If you need to remove non-adjacent lines (or if you want to remove duplicates without rearranging your file), the following Perl one-liner should do it (stolen from here):

cat textfile | perl -ne '$H{$_}++ or print'

answered Apr 14 '09 at 08:09

Matt J

43,589
7
49
57

I think this is a neat answer. Been programming in Perl for about 6 years now and wouldn't have thought of something so concise – Xetius Apr 14 '09 at 08:21
1

The Perl part is really nifty. This does, however, qualify for the "Useless Use of cat" award :-) (see http://partmaps.org/era/unix/award.html). Just use " – sleske Apr 14 '09 at 08:23
2

I'd never heard of that award! Yeah, I do use cat rather gratuitously sometimes; I have no idea idea why "cat x | " looks any better than "< x" to me.. it just does :) It may have something to do with the fact that I very often redirect stdout as well, and "./prog < x > y" makes my eyes bleed :P – Matt J Apr 14 '09 at 13:28
4

Useless use of cat award! Use perl -ne ...whatever... textfile – Bklyn Apr 16 '09 at 03:16
1

To get only non-unique lines from an unsorted input, based on @MattJ's answer: `perl -ne '0==$H{$_}++ or print'`. Note that it will print the second occurrence - the first duplicate, that is. – Joel Purra Jul 11 '12 at 03:36
If you want to keep blank lines: `perl -ne 'if (/\S/) {$H{$_}++ or print} else {print}'`. – Tor Klingberg Jul 15 '16 at 09:46
This is a great answer, but did no one notice the OP's "but without the sorting" request? – bballdave025 Jun 02 '20 at 01:30

The Archetypal Paul · Accepted Answer · 2009-04-14T08:01:09.567

21

uniq(1)

SYNOPSIS

uniq [OPTION]... [INPUT [OUTPUT]]

DESCRIPTION

Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Or, if you want to remove non-adjacent duplicate lines as well, this fragment of perl will do it:

while(<>) {
    print $_ if (!$seen{$_});
    $seen{$_}=1;
}

edited Apr 14 '09 at 08:01

answered Apr 14 '09 at 07:53

The Archetypal Paul

41,321
20
104
134

The Perl answer only works if you want the first item. The last would be a different solution. – Xetius Apr 14 '09 at 08:22
1

And for those who don't know how to use Perl, this is all you need to type: perl -pe 'print unless $seen{$_}++' [INPUT] > OUTPUT – reinierpost Apr 14 '09 at 08:43
@Xetuis, they're the same line :) If you do want the last line, just set the seen entry to the line number, don't print in the loop, then and print them out in order of line number at the end. But I don't think that's neded in this case. – The Archetypal Paul Apr 14 '09 at 09:51
@reinierpost, yep, I can never recall the command line options to do that so I tend to resort to full scripts... – The Archetypal Paul Apr 14 '09 at 09:51

Rishabh Sagar · Answer 3 · 2011-07-18T14:14:38.990

3

Here is an awk implementation, incase the environment does not have / allow perl (haven't seen one yet)! PS: If there are more than one duplicate lines, then this prints duplicate outputs.

awk '{

# Cut out the key on which duplicates are to be determined.
key = substr($0,2,14)

#If the key is not seen before, store in array,else print
if ( ! s[key] )
    s[key] = 1;
else
    print key;
}'

edited Jul 18 '11 at 14:14

answered Jul 18 '11 at 14:09

Rishabh Sagar

1,744
2
17
27

4

If you're just looking at the entire line being the key, this is analogous to the perl solutions: `awk '!c[$0]++' file` – glenn jackman Jul 26 '11 at 15:03

score 1 · Answer 4 · answered Apr 14 '09 at 07:53

1

If you are interested in removing adjacent duplicate lines, use uniq.

If you want to remove all duplicate lines, not just adjacent ones, then it's trickier.

answered Apr 14 '09 at 07:53

C. K. Young

219,335
46
382
435

score 1 · Answer 5 · answered Apr 14 '09 at 08:03

Here's what I came up with while I was waiting for an answer here (though the first (and accepted) answer came in about 2 minutes). I used this substitution in VIM:

%s/^\(.*\)\n\1$/\1/

Which means: look for lines where after the newline we have the same as before, and replace them only with what we captured in the first line.

uniq is definitely easier, though.

How to remove duplicate lines from a file

5 Answers5

Linked