remove dups from many csv files

Question

Given n csv files where they add up to 100 GB in size, I need to remove duplicate rows based on the following rules and conditions:

The csv files are numbered 1.csv to n.csv, and each file is about 50MB in size.
The first column is a string key, 2 rows are considered dup if their first columns are the same.
I want to remove dups by keeping the one in a later file (2.csv is considered later than 1.csv)

My algorithm is the following, I want to know if there's a better one.

merge all files into one giant file
```
cat *.csv > one.csv
```
sort the csv
```
sort one.csv >one_sorted.csv
```
not sure how to elimiate dups at this point. uniq has a -f flag that skips the first N fields, but in my case I want to skip all but the first 1 fields.

I need help with the last step (eliminating dups in a sorted file). Also is there a more efficient algorithm?

Unless you tag the lines with the file number, the sort will mean you can no longer satisfy the third (later entry wins) condition. At the very least, `sort -o one_sorted.csv *.csv` will be less disk space intensive and quicker than the `cat` + `sort` option. — Jonathan Leffler, Oct 15 '12 at 04:01
How much duplication do you expect to find? How big do you think the deduplicated output will be? — Jonathan Leffler, Oct 15 '12 at 04:04
@Jonathan Leffler: if we can't keep the 3rd condition then let's forget about it. There are very little duplications(probably less than 3%) — user121196, Oct 15 '12 at 07:06
How big is each record (on average)? How big is each key (on average)? How variable is the sizing? How big a machine is this workload running on (what is the main memory on the machine)? — Jonathan Leffler, Oct 15 '12 at 07:09

Steve · Answer 1 · 2012-10-15T04:13:58.650

2

Here's one way using GNU awk:

awk -F, '{ array[$1]=$0 } END { for (i in array) print array[i] }' $(ls -v *.csv)

Explanation: Reading a numerically sorted glob of files, we add the first column of each file to an associative array whose value is the whole line. In this way, the duplicate that's kept is the one that occurs in the latest file. Once complete, loop through the keys of the array and print out the values. GNU awk does provide sorting abilities through asort() and asorti() functions, but piping the output to sort makes things much easier to read, and is probably quicker and more efficient.

You could do this if you require numerical sorting on the first column:

awk -F, '{ array[$1]=$0 } END { for (i in array) print array[i] | "sort -nk 1" }' $(ls -v *.csv)

edited Oct 15 '12 at 04:13

answered Oct 15 '12 at 03:28

Steve

51,466
13
89
103

+1: That's pretty neat if `awk` can hold one record for each unique key in memory. – Jonathan Leffler Oct 15 '12 at 04:07
With ~ 100GB of text data I have my doubts :} – tink Oct 15 '12 at 04:13
@steve: would the program break if it runs out of memory? – user121196 Oct 15 '12 at 07:14
@user121196: Your whole system would break if it consumes more memory than what's available. Access to big machines makes life a lot easier. – Steve Oct 15 '12 at 12:03

score 1 · Accepted Answer · edited May 23 '17 at 12:10

If you can keep the lines in memory

If enough of the data will fit in memory, the awk solution by steve is pretty neat, whether you write to the sort command by pipe within awk or simply by piping the output of the unadorned awk to sort at the shell level.

If you have 100 GiB of data with perhaps 3% duplication, then you'll need to be able to store 100 GiB of data in memory. That's a lot of main memory. A 64-bit system might handle it with virtual memory, but it is likely to run rather slowly.

If the keys fit in memory

If you can't fit enough of the data in memory, then the task ahead is much harder and will require at least two scans over the files. We need to assume, pro tem, that you can at least fit each key in memory, along with a count of the number of times the key has appeared.

Scan 1: read the files.
- Count the number of times each key appears in the input.
- In awk, use icount[$1]++.
Scan 2: reread the files.
- Count the number of times each key has appeared; ocount[$1]++.
- If icount[$1] == ocount[$1], then print the line.

(This assumes you can store the keys and counts twice; the alternative is to use icount (only) in both scans, incrementing in Scan 1 and decrementing in Scan 2, printing the value when the count decrements to zero.)

I'd probably use Perl for this rather than awk, if only because it will be easier to reread the files in Perl than in awk.

Not even the keys fit?

What about if you can't even fit the keys and their counts into memory? Then you are facing some serious problems, not least because scripting languages may not report the out of memory condition to you as cleanly as you'd like. I'm not going to attempt to cross this bridge until it's shown to be necessary. And if it is necessary, we'll need some statistical data on the file sets to know what might be possible:

Average length of a record.
Number of distinct keys.
Number of distinct keys with N occurrences for each of N = 1, 2, ... max.
Length of a key.
Number of keys plus counts that can be fitted into memory.

And probably some others...so, as I said, let's not try crossing that bridge until it is shown to be necessary.

Perl solution

Example data

$ cat x000.csv
abc,123,def
abd,124,deg
abe,125,deh
$ cat x001.csv
abc,223,xef
bbd,224,xeg
bbe,225,xeh
$ cat x002.csv
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$ perl fixdupcsv.pl x???.csv
abd,124,deg
abe,125,deh
abc,223,xef
bbd,224,xeg
cbc,323,zef
cbd,324,zeg
bbe,325,zeh
$

Note the absence of gigabyte-scale testing!

fixdupcsv.pl

This uses the 'count up, count down' technique.

#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.

use strict;
use warnings;

# Scan 1 - count occurrences of each key

my %count;
my @ARGS = @ARGV;   # Preserve arguments for Scan 2

while (<>)
{
    $_ =~ /^([^,]+)/;
    $count{$1}++;
}

# Scan 2 - reread the files; count down occurrences of each key.
# Print when it reaches 0.

@ARGV = @ARGS;      # Reset arguments for Scan 2

while (<>)
{
    $_ =~ /^([^,]+)/;
    $count{$1}--;
    print if $count{$1} == 0;
}

The 'while (<>)' notation destroys @ARGV (hence the copy to @ARGS before doing anything else), but that also means that if you reset @ARGV to the original value, it will run through the files a second time. Tested with Perl 5.16.0 and 5.10.0 on Mac OS X 10.7.5.

This is Perl; TMTOWTDI. You could use:

#!/usr/bin/env perl
#
# Eliminate duplicate records from 100 GiB of CSV files based on key in column 1.

use strict;
use warnings;

my %count;

sub counter
{
    my($inc) = @_;
    while (<>)
    {
        $_ =~ /^([^,]+)/;
        $count{$1} += $inc;
        print if $count{$1} == 0;
    }
}

my @ARGS = @ARGV;   # Preserve arguments for Scan 2
counter(+1);
@ARGV = @ARGS;      # Reset arguments for Scan 2
counter(-1);

There are probably ways to compress the body of the loop, too, but I find what's there reasonably clear and prefer clarity over extreme terseness.

Invocation

You need to present the fixdupcsv.pl script with the file names in the correct order. Since you have files numbered from 1.csv through about 2000.csv, it is important not to list them in alphanumeric order. The other answers suggest ls -v *.csv using the GNU ls extension option. If it is available, that's the best choice.

perl fixdupcsv.pl $(ls -v *.csv)

If that isn't available, then you need to do a numeric sort on the names:

perl fixdupcsv.pl $(ls *.csv | sort -t. -k1.1n)

Awk solution

awk -F, '
BEGIN   {
            for (i = 1; i < ARGC; i++)
            {
                while ((getline < ARGV[i]) > 0)
                    count[$1]++;
                close(ARGV[i]);
            }
            for (i = 1; i < ARGC; i++)
            {
                while ((getline < ARGV[i]) > 0)
                {
                    count[$1]--;
                    if (count[$1] == 0) print;
                }
                close(ARGV[i]);
            }
        }'

This ignores awk's innate 'read' loop and does all reading explicitly (you could replace BEGIN by END and would get the same result). The logic is closely based on the Perl logic in many ways. Tested on Mac OS X 10.7.5 with both BSD awk and GNU awk. Interestingly, GNU awk insisted on the parentheses in the calls to close where BSD awk did not. The close() calls are necessary in the first loop to make the second loop work at all. The close() calls in the second loop are there to preserve symmetry and for tidiness — but they might also be relevant when you get around to processing a few hundred files in a single run.

+1 Yes I agree if the OP doesn't have access to a machine big enough, then he will have to 'Perl' it. It wouldn't be easy with `awk`. I like the second (alternative approach) because it would be the most efficient way to go, especially if the keys are really large. — Steve, Oct 15 '12 at 05:59
@Jonathan Leffler: the keys will will fit into memory. Would you provide the real code in awk? — user121196, Oct 15 '12 at 07:20
If I could vote multiple times, I would. I like the `perl` solution, but it could be more sexy. For example, you could replace the first loop with `$count{ (split(',', $_))[0] }++ while <>;`. It may be a bit more maintainable too. Don't forget to include how to run the beast too: `./script.pl $(ls -v *.csv)`. — Steve, Oct 16 '12 at 01:06
@user121196: The `perl` solution should be the accepted answer. — Steve, Oct 16 '12 at 01:07
@steve: It's Perl — [TMTOWTDI](http://acronymfinder.com/TMTOWTDI.html)! An interesting variation uses the same function twice, with arguments +1 and -1: `my @ARGS = @ARGV; counter(+1); @ARGV = @ARGs; counter(-1);` and the sub `counter` does the read and print. Because of the increment, it will never print anything on the first call; because of the decrement, it will print on the second call. I'm not sure that splitting the entire string and then discarding all except the first split field is efficient. — Jonathan Leffler, Oct 16 '12 at 02:42
Good improvement: this is what I like to see :-) Re the last point; I've always been under the impression that taking a slice of a split call was faster, or was at least as fast as regex. But apparently I'm wrong (see point 7 [here](http://www.design-reuse.com/articles/20613/perl-scripting-chip-design.html)). Thanks for the tip! — Steve, Oct 16 '12 at 03:47
@Jonathan Lefflier: the perl version has a problem if any of the csv file does not end with a newline. — user121196, Jan 20 '13 at 09:43
@user121196: That is left as an exercise for the reader. You can chomp the lines as they arrrive. You can work with the definition that text files always end with a newline; if it is a malformed file (not ending with a newline), GIGO — Garbage In, Garbage Out. This is routine stuff that you should be able to deal with without blinking an eyelid — well, twenty years down the road you'll be able to do it without blinking an eyelid. — Jonathan Leffler, Jan 20 '13 at 15:04

score 0 · Answer 3 · edited May 23 '17 at 10:30

0

My answer is based on steve's

awk -F, '!count[$1]++' $(ls -rv *.csv)

{print $0} is implied in the awk statement.

Essentially awk prints only the first line whose $1 contains that value. Since the .csv files are listed in reversed natural order, this means for all the lines that has the same value for $1, only the one in the latest file is printed.

Note: This will not work if you have duplicates in the same file (i.e. if you have multiple instances of the same key within the same file)

edited May 23 '17 at 10:30

Community

1
1

answered Oct 15 '12 at 05:03

doubleDown

8,048
1
32
48

The problem asks for the last line with the same key, not the first line with a given key. That complicates life! And why won't the mechanism work if you have duplicates in the same file? It looks like it should be OK to me. – Jonathan Leffler Oct 15 '12 at 05:11
@JonathanLeffler: Yes, I originally considered this exact method but unfortunately the array won't hold the latest keys if they're duplicates from within the same file. It will hold the first instance of the key. – Steve Oct 15 '12 at 05:20
@JonathanLeffler: the OP explicitly mentioned dup (same key) across different files (in this case, the code will work). But he did not mention dups within the same file (in this case, the code won't work - this is probably your main concern here, but do note that I have mentioned this very limitation in the **Note** within my answer) – doubleDown Oct 15 '12 at 06:01
OK — I now see what I was missing, and what you carefully didn't highlight. It's the letter `r` in your options to `ls`. This lists the files in reverse order. It also explains why your algorithm won't work if there are repeat entries in a single file. Probably, you should have highlighted these points. – Jonathan Leffler Oct 15 '12 at 06:09
@JonathanLeffler: I kind of did, to quote "...Since the .csv files are listed in *reversed* natural order" – doubleDown Oct 15 '12 at 06:12
I see I missed the key point of your explanation (which was the `r` option), even though you did mention the reversed order of file processing. _I goofed. I'm sorry._ (I'm not keen on the 'no duplicates within the same file' limitation; it feels like there'll be trouble there. But that was clearly highlighted.) – Jonathan Leffler Oct 15 '12 at 06:21

score 0 · Answer 4 · edited Oct 15 '12 at 07:12

Regarding your sorting plan, it might be more practical to sort the individual files and then merge them, rather than concatenating and then sorting. The complexity of sorting using the sort program is likely to be O(n log(n)). If you have say 200000 lines per 50MB file, and 2000 files, n will be about 400 million, and n log(n) ~ 10^10. If instead you treat F files of R records each separately, the cost of sorting is O(F*R*log(R)) and the cost of merging is O(F*R*log(R)). These costs are high enough that separate sorting is not necessarily faster, but the process can be broken into convenient chunks so can be more easily checked as things go along. Here is a small-scale example, which supposes that comma can be used as a delimiter for the sort key. (A quote-delimited key field that contains commas would be a problem for the sort as shown.) Note that -s tells sort to do a stable sort, leaving lines with the same sort key in the order they were encountered.

for i in $(seq 1 8); do sort -t, -sk1,1 $i.csv > $i.tmp; done
sort -mt, -sk1,1 [1-8].tmp > 1-8.tmp

or if more cautious might save some intermediate results:

sort -mt, -sk1,1 [1-4].tmp > 1-4.tmp
sort -mt, -sk1,1 [5-8].tmp > 5-8.tmp
cp 1-4.tmp 5-8.tmp /backup/storage
sort -mt, -sk1,1 1-4.tmp 5-8.tmp > 1-8.tmp

Also, an advantage of doing separate sorts followed by a merge or merges is the ease of splitting the workload across multiple processors or systems.

After you sort and merge all the files (into, say, file X) it is fairly simple to write an awk program that at BEGIN reads a line from X and puts it in variable L. Thereafter, each time it reads a line from X, if the first field of $0 doesn't match L, it writes out L and sets L to $0. But if $0 does match L, it sets L to $0. At END, it writes out L.

The difficulty with any algorithm that starts of with sorting the data is that you lose the positional information about which row with a given key came last. — Jonathan Leffler, Oct 15 '12 at 06:11
@JonathanLeffler, you are wrong. I specified -s to get stable sort. However, I should have specified `-k1,1` as well (and possibly `-t,`) and have edited my answer accordingly. — James Waldby - jwpat7, Oct 15 '12 at 06:44
I missed the `-s` when scanning your answer - sorry. Clearly, it must be bedtime for me. — Jonathan Leffler, Oct 15 '12 at 07:11