Removing duplicate rows across millions of compressed CSV files while keeping one piece of information from duplicated rows

Question

Have a collection of ~10 million GZipped CSV files, each one having anywhere from 100-1000 rows and >2000 columns. Each file also contains a header.

In each CSV file there are two important columns, "ID" and "target".

I'm trying to remove the rows with duplicate "target" but retain the ID from the row to be removed with the row that will not be removed.

E.g.

Input:

CSV1
|   ID  |  Target                      |
|-------|------------------------------|
| IX213 | C1=CC(=CC=C1CC(=O)C(=O)O)O   |
| IX412 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |

CSV2
|   ID  |  Target                      |
|-------|------------------------------|
| BC144 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
| BC155 | C(CC(=O)O)C(C(=O)O)N         |

Output:

CSV1*
|   ID         |  Target                      |
|--------------|------------------------------|
| IX213        | C1=CC(=CC=C1CC(=O)C(=O)O)O   |
| IX412; BC144 | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |

CSV2*
|   ID  |  Target                      |
|-------|------------------------------|
| BC155 | C(CC(=O)O)C(C(=O)O)N         |

This would be straightforward for a small number of files with Pandas (Python) or the like, but was hoping someone might have a much better way of doing it across millions of files with billions of entries.

Are you merging the files in pairs only, or do you need to merge all the files toward the first one? — Prune, Jul 10 '19 at 21:51
Have you explored using a database for this? Assuming this is not the only operation you will be performing on this data, having it normalized and indexed for random access should offer quick and handsome paybacks in terms of convenience and execution speed (even if, like me, you don't particularly enjoy SQL). — tripleee, Jul 11 '19 at 06:25
# Are you merging the files in pairs only, or do you need to merge all the files toward the first one? # De-duplicating across all entries and merging all duplicates to one. - Have not tried using SQL for this, but might be right that I should just bite the bullet and do it since I'll be using it more than once. — StuckOnDiffyQ, Jul 11 '19 at 12:48

score 2 · Accepted Answer · answered Jul 10 '19 at 22:06

I would write a program that goes through a gzipped csv file and writes the following 4 columns: target id file row.

Run this on each file, and you will get ~10 million small files.

Assuming that you are NOT doing this in a distributed fashion, I would next combine the files into a single, big file, and sort it with the unix sort utility. (Warning you want to do LC_ALL=C sort foo.txt because the C locale is faster, and produces more sensible results. See sort not sorting as expected (space and locale) for more.)

Now it is easy to process that file, and decide which one to keep. You can write a file out with the columns file row target id is_keep removed_ids. Be sure to write the row out with leading zeros, so instead of 42 you would write 000042. The removed_ids are the ones you removed from other files if you kept this one. (The number of leading zeros should be enough for your largest file. That is to make asciibetical order match numerical order.)

Sort this file again and then break it up into per file files of what decision.

Given the original gzipped file and this file of which rows to keep, and which ids to store if you do keep it, it is easy to process your original files to remove/keep rows and record what you removed. I strongly suggest doing the sanity check of verifying that target/id/row all match. And don't delete the originals unless that sanity check passed.

If you're a big fan of distributed processing, the translation from sorting to map-reduces is straightforward. If you have an infrastructure for that set up, you might as well use it. If not, I would suggest this sorted file approach, only using parallelization to process all the individual files at the first/last.

Exactly: use the tools that exist. There might even be a standard tool that will extract the fields. For example, `csvtool` that is available in Ubuntu. The only thing I'd change here is have the initial program append to a single file, rather than writing to individual files that I have to then combine. — Jim Mischel, Jul 11 '19 at 16:28
@JimMischel I thought about single file vs multiple files, did a back of the envelope and found that we are dealing with multiple terabytes of data, and went with the approach that let us take advantage of the fact that the first and last steps are embarrassingly parallel. — btilly, Jul 11 '19 at 19:25

josoler · Answer 2 · 2019-07-11T06:09:22.427

Even though that the amount of data may look overwhelming, I think that you could iterate over all files sequentially if you keep just the right amount of data you need. For instance you could track the relation with unique targets with the first ID with such target and a relation of ID aliases (e.g. ID IX412 corresponds to BC144). This way your solution may look something like this:

import csv

filenames = [...]
target_ids = {}
aliases = {}

for filename in filenames:
    with open(filename, 'r') as file_in:
        reader = csv.DictReader(file_in)
        for row in reader:
            if row['Target'] in target_ids:
                aliases[row['ID']] = target_ids[row['Target']]
                remove_row(row)  # Do whatever you may require
            else:
                target_ids[row['Target']] = row['ID']

Note that having a dict with 10M key-value pairs is something perfectly tractable.

If this still doesn't fit in memory you could use shelve instead of dicts so that the corresponding data is stored in HDD. You could do something like:

import csv
import shelve

filenames = [...]

with shelve.open('target_ids') as target_ids, shelve.open('aliases') as aliases:
    for filename in filenames:
        with open(filename, 'r') as file_in:
            reader = csv.DictReader(file_in)
            for row in reader:
                if row['Target'] in target_ids:
                    aliases[row['ID']] = target_ids[row['Target']]
                    remove_row(row)  # Do whatever you may require
                else:
                    target_ids[row['Target']] = row['ID']

The downside of shelve with regards to regular dict being speed.

You need to keep on the order of 2³² IDs, each 6 or 7 characters minimum if keeping to *single case letters+digits* - about 60 GB. Plus about as many rows of `>2000 columns` as there are "target" values. Steep for a single node in 2019. — greybeard, Jul 11 '19 at 03:04
@greybeard Note that we don't need to keep track of the full rows with this approach. Worst case scenario (no repeated targets at all): We'll have ~10M * (100 to 1000) ID-Target pairs. Still it may not fit in memory, if so, I would just replace dicts with shelves — josoler, Jul 11 '19 at 06:25
`we don't need to keep track of the full rows` I read `remove the rows with duplicate "target"` as *keep one full row per "target"* - "we" could create the required output in a second pass, writing all IDs for each "target" encountered and the remainder of the row *if the "target" is in the lookup*, removing it immediately after. — greybeard, Jul 11 '19 at 08:44
Can give this a try, absolute maximum should be 2.1B key pairs assuming nothing is duplicated, but realistically estimating should be more like ~1.3B — StuckOnDiffyQ, Jul 11 '19 at 13:19

Removing duplicate rows across millions of compressed CSV files while keeping one piece of information from duplicated rows

2 Answers2