Removing duplicates on very large datasets

Question

I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them

import pandas

concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False

However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?

Possible duplicate of ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) — kur ag, Sep 19 '18 at 13:54
You might have some luck with [dask](http://dask.pydata.org/en/latest/) — roganjosh, Sep 19 '18 at 13:54
If you have Linux, this works well : cat Concat\ EPC3.csv | sort | uniq (be careful with the header if you have one) — Corentin Limier, Sep 19 '18 at 13:55
chunking the df in this situation will be efficient because it would remove duplicates that occur in the chunk, but if duplicates occur in multiple chunks they won't be removed (just the duplicates within the chunk will be). — d_kennetz, Sep 19 '18 at 13:57
One thing I would recommend doing is seeing if you can decrease the memory of the DF by specifying the dtype of each column. For example, pandas tries to guess the datatype up front. Sometimes you think a column may be purely type Int or float, but Pandas may assign it as an object, or there may be a column that is repetitve that is an object that could be assigned a category. Obj to Float or int conversion drastically decreases memory, so does object to category. — d_kennetz, Sep 19 '18 at 14:01
@CorentinLimier don't promote a [useless use of `cat`](http://porkmail.org/era/unix/award.html). Your pipeline is "equivalent but worse" to `sort -u `. Anyway since it is a `csv` I'd first try to use `csvsort` instead... — Giacomo Alzetta, Sep 19 '18 at 14:08
@d_kennetz I will try that, and let you know if it works. Thanks — Vlad, Sep 19 '18 at 14:10

zwer · Answer 1 · 2018-09-19T14:59:22.550

12

The simplest solution would be creating a hash table for each line in the file - storing 16M hashes in your working memory shouldn't be a problem (depends on the hash size, tho) - then you can iterate over your file again and make sure that you write down only one occurrence of each hash. You don't even need to parse your CSV nor you need Pandas.

import hashlib

with open("input.csv", "r") as f_in, \
        open("output.csv", "w") as f_out:
    seen = set()  # a set to hold our 'visited' lines
    for line in f_in:  # iterate over the input file line by line
        line_hash = hashlib.md5(line.encode()).digest()  # hash the value
        if line_hash not in seen:  # we're seeing this line for the first time
            seen.add(line_hash)  # add it to the hash table
            f_out.write(line)  # write the line to the output

This uses MD5 as a hash so it would take about 16B + set overhead per line, but that's still far less than storing everything in the memory - you can expect ~500MB of memory usage for a 16M lines CSV file.

edited Sep 19 '18 at 14:59

answered Sep 19 '18 at 14:39

zwer

24,943
3
48
66

Wouldn't it be more efficient to loop through the input lines only once? I/O operations are likely the most expensive. – norok2 Sep 19 '18 at 14:54
Some lines could be equivalent as far as CSV is concerned but hash to different values. For example CSV handles whitespace and quotes. – ditkin Sep 19 '18 at 14:57
1

@norok2 - True that, I've updated it with a single-pass approach. – zwer Sep 19 '18 at 15:00
@ditkin - Dealing with different formats can be solved with the same approach, one just needs a pre-process step to turn every CSV line into a reliably hashable string before hashing it. – zwer Sep 19 '18 at 15:04
Also using a `list` for `seen` may be faster in some circumstances, especially as the size of `seen` grows... unfortunately `.add()` for set does not return if successful or not, and hence with `set()` you are actually checking if the hash is in `seen` twice, but since `in set()` is faster than `in list()` there is some room for `in list()` being slower for small `seen`. – norok2 Sep 19 '18 at 15:07
1

@norok2 - Sets are much, much faster for lookup (they are hash tables by themselves) than lists, especially as the size of `seen` grows. – zwer Sep 19 '18 at 15:09
1

@norok2 - If we are going to count ops, `in set()` is an `O(1)` operation (hash table lookup), `list.append()` and `set.add()` are also `O(1)`, with the caveat that the set one takes longer to process (due to hashing) and the fact that the underlying system might need more operations as both the list and the set grow (i.e. not counting amortized complexity). Meanwhile, `in list()` is an `O(n)` operation, which will get linearly slower as you add more elements to the list. Except for very, very small lists, sets will **always** outperform them in this sort of a task. – zwer Sep 19 '18 at 15:26
Yes, sorry, I was considering worst case scenario (for which `in set()` is actually `O(n)`), not average. – norok2 Sep 19 '18 at 15:36

jdehesa · Accepted Answer · 2023-06-22T14:52:12.040

Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).

file_in = "C:\\OUT\\Concat EPC3.csv"
file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"

with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    # Skip header
    next(f_in)
    # Find duplicated hashes
    hashes = set()
    hashes_dup = {}
    for row in f_in:
        h = hash(row)
        if h in hashes:
            hashes_dup[h] = set()
        else:
            hashes.add(h)
    del hashes
    # Rewind file
    f_in.seek(0)
    # Copy header
    f_out.write(next(f_in))
    # Copy non repeated lines
    for row in f_in:
        h = hash(row)
        if h in hashes_dup:
            dups = hashes_dup[h]
            if row in dups:
                continue
            dups.add(row)
        f_out.write(row)

last line should be `f_out.write(row)` instead of `f_out.write(next(f_in))`, otherwise iterator skips one row and write the next one — Alexander Myasnikov, Jun 22 '23 at 14:37

score 0 · Answer 3 · answered Sep 19 '18 at 14:50

0

What about a UNIX simulator?

uniq <filename> >outputfile.txt

(something like that)

answered Sep 19 '18 at 14:50

Dominique

16,450
15
56
112

1

`uniq` only works if the duplicate lines follow each other (it skips every line that was the same as the previous line). So for most applications you would want to `sort` first and then `uniq`. But it usually won't work if you have strings with newlines in your csv file. Or if you have an index in the csv file (because an index is unique). And you will want to look out for the header so it doesn't get sorted. But if all of that is fine then `uniq` is very fast. And runs in O(n). – grofte Mar 04 '21 at 21:47

Removing duplicates on very large datasets

3 Answers3

Linked