I need to remove all lines that occur more than once in a file.
Example:
Line1
Line2
Line3
Line2
Result:
Line1
Line3
Python, Perl or unix-util, doesn't matter. Thank you.
I need to remove all lines that occur more than once in a file.
Example:
Line1
Line2
Line3
Line2
Result:
Line1
Line3
Python, Perl or unix-util, doesn't matter. Thank you.
Preserves order, but keeps two copies of the file in memory:
my @lines;
my %seen;
while (<>) {
push @lines, $_;
++$seen{$_};
}
for (@lines) {
print if $seen{$_} == 1;
}
As a one-liner:
perl -ne'push @l, $_; ++$s{$_}; }{ for (@l) { print if $s{$_} == 1; }'
Doesn't preserve order, but keeps only one copy of the file in memory:
my %seen;
++$seen{$_} while <>;
while (my ($k, $v) = each(%seen)) {
print $k if $v == 1;
}
As a one-liner:
perl -ne'++$s{$_}; }{ while (my ($k, $v) = each(%s)) { print $k if $v == 1; }'
Here's a Python implementation.
If you need to preserve the initial order of the lines:
import collections
import fileinput
lines = list(fileinput.input())
counts = collections.Counter(lines)
print(''.join(line for line in lines if counts[line] == 1))
If not, it's a tiny bit simpler and faster):
import collections
import fileinput
counts = collections.Counter(fileinput.input())
print(''.join(line for line, count in counts.iteritems() if count==1))
For each line, you need to see if it has any dups. If you don't want to do this quadratically (doing one pass, and then a second pass for each line), you need to use an intermediate data structure that allows you to do it in two linear passes.
So, you make a pass through the list to build a hash table (collections.Counter
is a specialized dict
that just maps each key to the number of times it appears). Then, you can either make a second pass through the list, looking each one up in the hash table (first version), or just iterate the hash table (second).
As far as I know, there's no way to do the equivalent with command-line tools; you will at least have to sort
the input (which is O(N log N), instead of O(N)), or use a tool that implicitly does the equivalent.
But for many use cases, that's not a big deal. For an 80MB file with 1M lines, N log N is only an order of magnitude slower than N, and it's perfectly conceivable that the constant-multiplier difference between two tools will be on the same order.
A quick timing test verifies that, on the scale of 1M lines, the sort | uniq -u
version is just over 6x slower, but still fast enough that you probably won't care (under 10 seconds, which is more time than it would take to copy and paste the Python code, right?) unless you have to do this repeatedly.
From further tests, at 128K lines, the Python version is only 4x faster; at 64M lines, it's 28x faster; at 5G lines… both versions drive the system into swap thrashing badly enough that I killed the tests. (Replacing the Counter
with a dbm
key-value database solves that problem, but at a huge cost for smaller scales.)
The *nix command uniq can do this.
sort file.name | uniq -u
Here's an example in perl:
my %line_hash;
open my $fh, "<", "testfile";
while(my $line = <$fh>) {
$line_hash{$line}++;
}
close $fh;
open my $out_fh, ">>", "outfile";
for my $key ( sort keys %line_hash ){
print $out_fh $key if $line_hash{$key} == 1;
}
close $out_fh;
testfile:
$ cat testfile
Line1
Line2
Line3
Line2
outfile:
$ cat outfile
Line1
Line3
sort inputfile | uniq -u
(assuming gnu coreutils uniq)
Though SUSv4 says:
-u Suppress the writing of lines that are repeated in the input.
it sounds from comments to some answers that not all uniqs interpret that the same way.
read each line, grep the line in the same file to find the count, only print the ones where the count is 1:
#!/bin/bash
while read line
do
if [ `grep -c ${line} sample.txt` -eq 1 ] ; then echo ${line} ; fi
done < sample.txt