How to drop duplicated rows using pandas in a big data file?

Question

I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'],      encoding='utf-8', chunksize=10000000)

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

But if duplicated rows distribute in different chunk seems like above script can't get the expected results.

Is there any better way？

score 3 · Answer 1 · answered Sep 07 '16 at 09:10

3

You could try something like this.

First, create your chunker.

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)

Now create a set of ids:

ids = set()

Now iterate over the chunks:

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

However, now, within the body of the loop, drop also ids already in the set of known ids:

    chunk = chunk[~chunk['Author ID'].isin(ids)]

Finally, still within the body of the loop, add the new ids

    ids.update(chunk['Author ID'].values)

If ids is too large to fit into main memory, you might need to use some disk-based database.

answered Sep 07 '16 at 09:10

Ami Tavory

74,578
11
141
185

Thanks! I tried it,but the memory still not enough. – You Gakukou Sep 08 '16 at 10:43
@yangxg Are you sure the set is the entity taking up your memory? What is its maximum `len`? If that is the problem, we need to escalate gradually. The next thing I would try is [`shelve`](https://docs.python.org/3/library/shelve.html). – Ami Tavory Sep 08 '16 at 11:01
would addthing something like this to the loop potentially help? chunk['Author ID'] = pd.to_numeric(chunk['Author ID'], downcast='integer') – Yale Newman Sep 20 '17 at 05:37

How to drop duplicated rows using pandas in a big data file?

1 Answers1

Linked

Related