How to remove duplicate lines in a pandas dataframe with defined chunksize?

Asked Mar 27 '23 at 16:11

Active Mar 27 '23 at 16:11

Viewed 53 times

Given that there is a big dataframe.

import pandas as pd
df = pd.DataFrame({'A':[1, 2, 1, 1, 1, 1], 'B': [1, 1, 1, 1, 2, 3]})
df.to_csv("tmp.csv", sep="|", index=False)
df = pd.read_csv("tmp.csv", sep="|", chunksize=3)

How can I remove all duplicate lines? Even in different chunks. That is, if in the first chunk a line 1, 1, then the other chunks cannot have it.

asked Mar 27 '23 at 16:11

Márcio Mocellin

drop duplicates when saving to csv – RomanPerekhrest Mar 27 '23 at 16:12
2

If you're reading via `chucksize` due to memory limitations... pandas may simply not be the best tool for this job. – BeRT2me Mar 27 '23 at 16:15
2

Do you have the possibility of doing a pre-clean of the file so you'd remove duplicate lines by calculating their hash (as suggested in [this other SO thread](https://stackoverflow.com/q/52407474/289011)?) The idea would be doing a "cleaning" pass over your original csv (you don't even need pandas for this), save that as a new file (a new file where you won't have duplicates), and then load the "cleaned" csv with pandas. – Savir Mar 27 '23 at 16:25
@RomanPerekhrest lol – Márcio Mocellin Mar 27 '23 at 17:35
@BeRT2me What would be a better tool? – Márcio Mocellin Mar 27 '23 at 17:35
1

You could try [polars](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html#polars-scan-csv) e.g. `df = pl.scan_csv(...).unique().collect()` – jqurious Mar 27 '23 at 18:08

How to remove duplicate lines in a pandas dataframe with defined chunksize?

0 Answers0