I would like to delete duplicates from one large csv. I have this csv format of data
client_id;gender;age;profese;addr_cntry;NAZOKRESU;prijem_AVG_6M_pasmo;cont_id;main_prod_id;bal_actl_am_pasmo
388713248;F;80;důchodce;CZ;Czech;;5715125;39775;
27953927;M;28;Dělník;CZ;Opavia;22;4427292;39075;
I need delete all duplicates from client_id. I can not handle this big file in python with Pandas. I tried dask, but same result. Just infinity time of waiting and nothing really happend.
Here is my last version of code
import dask.dataframe as dd
import chardet
from dask.diagnostics import ProgressBar
with open('bigData.csv', 'rb') as f:
result = chardet.detect(f.read())
df = dd.read_csv('bigData.csv', encoding=result['encoding'], sep=';')
total_rows = df.shape[0].compute()
df = df.drop_duplicates(subset=['client_id'], keep=False, Inplace=True)
df.to_csv('bigData.csv', sep=';', index=False)
total_duplicates = total_rows - df.shape[0].compute()
print(f'Was deleted {total_duplicates} duplicated rows.')
I tried it with progress bar and nothing really happened. Thanks for help!