If you can keep the lines in memory
If enough of the data will fit in memory, the awk
solution by steve is pretty neat, whether you write to the sort
command by pipe within awk
or simply by piping the output of the unadorned awk
to sort
at the shell level.
If you have 100 GiB of data with perhaps 3% duplication, then you'll need to be able to store 100 GiB of data in memory. That's a lot of main memory. A 64-bit system might handle it with virtual memory, but it is likely to run rather slowly.
If the keys fit in memory
If you can't fit enough of the data in memory, then the task ahead is much harder and will require at least two scans over the files. We need to assume, pro tem, that you can at least fit each key in memory, along with a count of the number of times the key has appeared.
- Scan 1: read the files.
- Count the number of times each key appears in the input.
- In
awk
, use icount[$1]++
.
- Scan 2: reread the files.
- Count the number of times each key has appeared;
ocount[$1]++
.
- If
icount[$1] == ocount[$1]
, then print the line.
(This assumes you can store the keys and counts twice; the alternative is to use icount
(only) in both scans, incrementing in Scan 1 and decrementing in Scan 2, printing the value when the count decrements to zero.)
I'd probably use Perl for this rather than awk
, if only because it will be easier to reread the files in Perl than in awk
.
Not even the keys fit?
What about if you can't even fit the keys and their counts into memory? Then you are facing some serious problems, not least because scripting languages may not report the out of memory condition to you as cleanly as you'd like. I'm not going to attempt to cross this bridge until it's shown to be necessary. And if it is necessary, we'll need some statistical data on the file sets to know what might be possible:
- Average length of a record.
- Number of distinct keys.
- Number of distinct keys with N occurrences for each of N = 1, 2, ... max.
- Length of a key.
- Number of keys plus counts that can be fitted into memory.
And probably some others...so, as I said, let's not try crossing that bridge until it is shown to be necessary.
Perl solution
Example data
$ cat x000.csv
abc,123,def
abd,124,deg
abe,125,deh
$ cat x001.csv
abc,223,xef
bbd,224,xeg
bbe,225,xeh
$ cat x002.csv
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$ perl fixdupcsv.pl x???.csv
abd,124,deg
abe,125,deh
abc,223,xef
bbd,224,xeg
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$
Note the absence of gigabyte-scale testing!
fixdupcsv.pl
This uses the 'count up, count down' technique.
#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.
use strict;
use warnings;
# Scan 1 - count occurrences of each key
my %count;
my @ARGS = @ARGV; # Preserve arguments for Scan 2
while (<>)
{
$_ =~ /^([^,]+)/;
$count{$1}++;
}
# Scan 2 - reread the files; count down occurrences of each key.
# Print when it reaches 0.
@ARGV = @ARGS; # Reset arguments for Scan 2
while (<>)
{
$_ =~ /^([^,]+)/;
$count{$1}--;
print if $count{$1} == 0;
}
The 'while (<>)
' notation destroys @ARGV
(hence the copy to @ARGS
before doing anything else), but that also means that if you reset @ARGV
to the original value, it will run through the files a second time. Tested with Perl 5.16.0 and 5.10.0 on Mac OS X 10.7.5.
This is Perl; TMTOWTDI. You could use:
#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.
use strict;
use warnings;
my %count;
sub counter
{
my($inc) = @_;
while (<>)
{
$_ =~ /^([^,]+)/;
$count{$1} += $inc;
print if $count{$1} == 0;
}
}
my @ARGS = @ARGV; # Preserve arguments for Scan 2
counter(+1);
@ARGV = @ARGS; # Reset arguments for Scan 2
counter(-1);
There are probably ways to compress the body of the loop, too, but I find what's there reasonably clear and prefer clarity over extreme terseness.
Invocation
You need to present the fixdupcsv.pl
script with the file names in the correct order. Since you have files numbered from 1.csv through about 2000.csv, it is important not to list them in alphanumeric order. The other answers suggest ls -v *.csv
using the GNU ls
extension option. If it is available, that's the best choice.
perl fixdupcsv.pl $(ls -v *.csv)
If that isn't available, then you need to do a numeric sort on the names:
perl fixdupcsv.pl $(ls *.csv | sort -t. -k1.1n)
Awk solution
awk -F, '
BEGIN {
for (i = 1; i < ARGC; i++)
{
while ((getline < ARGV[i]) > 0)
count[$1]++;
close(ARGV[i]);
}
for (i = 1; i < ARGC; i++)
{
while ((getline < ARGV[i]) > 0)
{
count[$1]--;
if (count[$1] == 0) print;
}
close(ARGV[i]);
}
}'
This ignores awk
's innate 'read' loop and does all reading explicitly (you could replace BEGIN by END and would get the same result). The logic is closely based on the Perl logic in many ways. Tested on Mac OS X 10.7.5 with both BSD awk
and GNU awk
. Interestingly, GNU awk
insisted on the parentheses in the calls to close
where BSD awk
did not. The close()
calls are necessary in the first loop to make the second loop work at all. The close()
calls in the second loop are there to preserve symmetry and for tidiness — but they might also be relevant when you get around to processing a few hundred files in a single run.