0

Given that there is a big dataframe.

import pandas as pd
df = pd.DataFrame({'A':[1, 2, 1, 1, 1, 1], 'B': [1, 1, 1, 1, 2, 3]})
df.to_csv("tmp.csv", sep="|", index=False)
df = pd.read_csv("tmp.csv", sep="|", chunksize=3)

How can I remove all duplicate lines? Even in different chunks. That is, if in the first chunk a line 1, 1, then the other chunks cannot have it.

Márcio Mocellin
  • 274
  • 5
  • 18
  • drop duplicates when saving to csv – RomanPerekhrest Mar 27 '23 at 16:12
  • 2
    If you're reading via `chucksize` due to memory limitations... pandas may simply not be the best tool for this job. – BeRT2me Mar 27 '23 at 16:15
  • 2
    Do you have the possibility of doing a pre-clean of the file so you'd remove duplicate lines by calculating their hash (as suggested in [this other SO thread](https://stackoverflow.com/q/52407474/289011)?) The idea would be doing a "cleaning" pass over your original csv (you don't even need pandas for this), save that as a new file (a new file where you won't have duplicates), and then load the "cleaned" csv with pandas. – Savir Mar 27 '23 at 16:25
  • @RomanPerekhrest lol – Márcio Mocellin Mar 27 '23 at 17:35
  • @BeRT2me What would be a better tool? – Márcio Mocellin Mar 27 '23 at 17:35
  • 1
    You could try [polars](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html#polars-scan-csv) e.g. `df = pl.scan_csv(...).unique().collect()` – jqurious Mar 27 '23 at 18:08

0 Answers0