remove duplicates on large csv file, only look at a subset of columns

Question

I need to remove duplicates for a large .csv file (50+GB), and I would like to do this using python. Several other questions address the issue broadly (ex: here and here), but they deal with exact duplicates.

In my case, the duplicates are not exact duplicates. The setup of my file is such that I compiled this file by pulling rows from several sources, and one column indicates the source of origin. This means that I would like to remove duplicates for a subset of columns. The size of the file means that I cannot load it into memory, so pandas is out.

How can I approach this problem (possibly modifying the solutions I linked to)?

Mohammad · Answer 1 · 2021-08-16T08:44:41.403

1

For large dataframes, you can use dask. It's syntax is similar to pandas so if you are familiar with it, you will have no problem. Example:

import dask.dataframe as dsk
df = dsk.read_csv('my_csv.csv')

The syntax for drop duplicates Here:

df.drop_duplicates(subset=None, split_every=None, split_out=1, ignore_index=False, **kwargs)

allows for subsets.

Example:

from dask.distributed import Client
import dask.dataframe as dsk
client = Client(memory_limit='32GB') # set your limit here
df = dsk.read_csv("my_csv", sample=100)
results = df.drop_duplicates(subset=['my_col'], split_out=df.npartitions)
results.to_csv(outdir, index=False)

edited Aug 16 '21 at 08:44

answered Aug 15 '21 at 16:16

Mohammad

3,276
2
19
35

Can you expand on this to show how/when to call the .compute() method and how to save the output as a .csv? I'm having trouble seeing how to do this without pulling the df into memory, and my current attempt to avoid this is not working properly. – amquack Aug 16 '21 at 00:52
@amquack added an example – Mohammad Aug 16 '21 at 08:43

score 0 · Answer 2 · answered Aug 15 '21 at 16:57

0

I recommend duckdb. Here is a quick intro/benchmarking: https://duckdb.org/2021/05/14/sql-on-pandas.html

answered Aug 15 '21 at 16:57

Igor Rivin

4,632
2
23
35

remove duplicates on large csv file, only look at a subset of columns

2 Answers2