0

I need to remove all lines that occur more than once in a file.

Example:

Line1
Line2
Line3
Line2

Result:

Line1
Line3

Python, Perl or unix-util, doesn't matter. Thank you.

tor11
  • 43
  • 6
  • Do you need to preserve the original order? – abarnert Apr 30 '13 at 17:05
  • This question has already been answered a couple of times; check [this][1]. [1]: http://stackoverflow.com/questions/1215208/how-might-i-remove-duplicate-lines-from-a-file – João Pereira Apr 30 '13 at 17:06
  • No the order is not important. – tor11 Apr 30 '13 at 17:08
  • @Hal: This isn't quite the same, because he wants to remove _all_ copies of the line, not all but the first. None of the solutions there will do that, and it's not obvious how to modify them to do so. – abarnert Apr 30 '13 at 17:14

6 Answers6

4

Preserves order, but keeps two copies of the file in memory:

my @lines;
my %seen;
while (<>) {
   push @lines, $_;
   ++$seen{$_};
}

for (@lines) {
   print if $seen{$_} == 1;
}

As a one-liner:

perl -ne'push @l, $_; ++$s{$_}; }{ for (@l) { print if $s{$_} == 1; }'

Doesn't preserve order, but keeps only one copy of the file in memory:

my %seen;
++$seen{$_} while <>;

while (my ($k, $v) = each(%seen)) {
   print $k if $v == 1;
}

As a one-liner:

perl -ne'++$s{$_}; }{ while (my ($k, $v) = each(%s)) { print $k if $v == 1; }'
amon
  • 57,091
  • 2
  • 89
  • 149
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • From a quick test, these seem to work. Scanning the code, they seem functionally identical to my two Python implementations, around the same size, +/- 50% speed, similar memory usage. And presumably it would be a lot more readable to someone who's familiar with Perl but not Python, and from a quick test it seems to work. So… +1. You might want to read the other perl answer, because I'm not sure if it's the same. (I long ago deliberately used electrodes to burn out the part of my brain that reads perl natively.) – abarnert Apr 30 '13 at 18:07
  • @abarnert, Yeah, it's actually pretty clean Perl code, but Perl looks a fair bit different than other languages, so it can be hard to read to an outsider. /// I saw the other Perl answer. It uses as much memory as my order-preserving solution, but it doesn't preserve order. (It also sorts, though it's only for aesthetic reasons, so it can be removed if performance is a problem.) – ikegami Apr 30 '13 at 19:43
2

Here's a Python implementation.

If you need to preserve the initial order of the lines:

import collections
import fileinput

lines = list(fileinput.input())
counts = collections.Counter(lines)
print(''.join(line for line in lines if counts[line] == 1))

If not, it's a tiny bit simpler and faster):

import collections
import fileinput

counts = collections.Counter(fileinput.input())
print(''.join(line for line, count in counts.iteritems() if count==1))

For each line, you need to see if it has any dups. If you don't want to do this quadratically (doing one pass, and then a second pass for each line), you need to use an intermediate data structure that allows you to do it in two linear passes.

So, you make a pass through the list to build a hash table (collections.Counter is a specialized dict that just maps each key to the number of times it appears). Then, you can either make a second pass through the list, looking each one up in the hash table (first version), or just iterate the hash table (second).


As far as I know, there's no way to do the equivalent with command-line tools; you will at least have to sort the input (which is O(N log N), instead of O(N)), or use a tool that implicitly does the equivalent.

But for many use cases, that's not a big deal. For an 80MB file with 1M lines, N log N is only an order of magnitude slower than N, and it's perfectly conceivable that the constant-multiplier difference between two tools will be on the same order.


A quick timing test verifies that, on the scale of 1M lines, the sort | uniq -u version is just over 6x slower, but still fast enough that you probably won't care (under 10 seconds, which is more time than it would take to copy and paste the Python code, right?) unless you have to do this repeatedly.

From further tests, at 128K lines, the Python version is only 4x faster; at 64M lines, it's 28x faster; at 5G lines… both versions drive the system into swap thrashing badly enough that I killed the tests. (Replacing the Counter with a dbm key-value database solves that problem, but at a huge cost for smaller scales.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
1

The *nix command uniq can do this.

sort file.name | uniq -u
Mike Gardner
  • 6,611
  • 5
  • 24
  • 34
  • No it can't. This will include the first occurrence of `Line2`, which the OP explicitly doesn't want. And even if he didn't want that, `uniq` requires the data to be sorted, which it isn't. – abarnert Apr 30 '13 at 17:14
  • 1
    Adding `-u` doesn't fix anything. That's just the default (and opposite of `-d`). Did you even try this? – abarnert Apr 30 '13 at 17:18
  • @abarnert: perhaps you have a less capable uniq – ysth Apr 30 '13 at 17:19
  • @ysth: I have a `uniq` that follows the POSIX spec. See [here](http://www.unix.com/man-page/POSIX/1posix/uniq/). – abarnert Apr 30 '13 at 17:20
  • @abarnert: and did you try it? I can see "Suppress the writing of lines that are repeated in the input." meaning two different things; my uniq (from coreutils 8.13) certainly does what was requested. – ysth Apr 30 '13 at 17:24
  • @ysth: I tried it without the `sort` (which Michael Gardner left off his answer), and may have wrongly assumed the reason it didn't work. – abarnert Apr 30 '13 at 17:26
  • @ysth It didn't work on my RH linux or Solaris 10 systems until I used *sort* I think there are some issues in many distros – Mike Gardner Apr 30 '13 at 17:29
  • I don't think it's _supposed_ to work without `sort`. The POSIX spec, and the GNU, BSD, and Sun manpages, are all a little ambiguous… but I can't find any system where it _does_ work without `sort`. But it seems to work on all of them _with_ `sort`. I think you're supposed to assume that the "scans adjacent lines" is a constraint that's true no matter what the flags are. – abarnert Apr 30 '13 at 17:30
  • At any rate, this is obviously O(N log N) instead of just O(N), but, as I pointed out in my answer, that may not be a problem for many use cases. – abarnert Apr 30 '13 at 17:30
1

Here's an example in perl:

my %line_hash;
open my $fh, "<", "testfile";
while(my $line = <$fh>) {
   $line_hash{$line}++; 
}
close $fh;

open my $out_fh, ">>", "outfile";
for my $key ( sort keys %line_hash ){
    print $out_fh $key if $line_hash{$key} == 1;
}
close $out_fh;

testfile:

$ cat testfile
Line1
Line2
Line3
Line2

outfile:

$ cat outfile
Line1
Line3
chrsblck
  • 3,948
  • 2
  • 17
  • 20
0
sort inputfile | uniq -u

(assuming gnu coreutils uniq)

Though SUSv4 says:

-u Suppress the writing of lines that are repeated in the input.

it sounds from comments to some answers that not all uniqs interpret that the same way.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • Thank you all, I tried your solutions, this one seem to be easiest one and working for me (Debian 2.6.32) – tor11 Apr 30 '13 at 17:37
-1

read each line, grep the line in the same file to find the count, only print the ones where the count is 1:

#!/bin/bash
while read line
do
  if [ `grep -c ${line} sample.txt` -eq 1 ] ; then echo ${line} ; fi
done < sample.txt
Bill
  • 304
  • 1
  • 4
  • This works, but it's quadratic: it rereads the entire file once per line. This is a really bad idea for large files. – abarnert Apr 30 '13 at 17:21
  • ...done < `uniq sample.txt` – Bill Apr 30 '13 at 17:35
  • `uniq` won't have any useful effect without `sort`—and, even with it, it's still a quadratic solution, and it doesn't add anything on top of what `sort` and `uniq` already give you on their own. – abarnert Apr 30 '13 at 17:36
  • which requires sort still... – Bill Apr 30 '13 at 17:48