2

I need to remove duplicates for a large .csv file (50+GB), and I would like to do this using python. Several other questions address the issue broadly (ex: here and here), but they deal with exact duplicates.

In my case, the duplicates are not exact duplicates. The setup of my file is such that I compiled this file by pulling rows from several sources, and one column indicates the source of origin. This means that I would like to remove duplicates for a subset of columns. The size of the file means that I cannot load it into memory, so pandas is out.

How can I approach this problem (possibly modifying the solutions I linked to)?

amquack
  • 837
  • 10
  • 24

2 Answers2

1

For large dataframes, you can use dask. It's syntax is similar to pandas so if you are familiar with it, you will have no problem. Example:

import dask.dataframe as dsk
df = dsk.read_csv('my_csv.csv')

The syntax for drop duplicates Here:

df.drop_duplicates(subset=None, split_every=None, split_out=1, ignore_index=False, **kwargs)

allows for subsets.

Example:

from dask.distributed import Client
import dask.dataframe as dsk
client = Client(memory_limit='32GB') # set your limit here
df = dsk.read_csv("my_csv", sample=100)
results = df.drop_duplicates(subset=['my_col'], split_out=df.npartitions)
results.to_csv(outdir, index=False)
Mohammad
  • 3,276
  • 2
  • 19
  • 35
  • Can you expand on this to show how/when to call the .compute() method and how to save the output as a .csv? I'm having trouble seeing how to do this without pulling the df into memory, and my current attempt to avoid this is not working properly. – amquack Aug 16 '21 at 00:52
  • @amquack added an example – Mohammad Aug 16 '21 at 08:43
0

I recommend duckdb. Here is a quick intro/benchmarking: https://duckdb.org/2021/05/14/sql-on-pandas.html

Igor Rivin
  • 4,632
  • 2
  • 23
  • 35