I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:
chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)
for chunk in chunker:
chunk.drop_duplicates(['Author ID'])
But if duplicated rows distribute in different chunk seems like above script can't get the expected results.
Is there any better way?