7

I have a csv file that too big to load to memory.I need to drop duplicated rows of the file.So I follow this way:

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'],      encoding='utf-8', chunksize=10000000)

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

But if duplicated rows distribute in different chunk seems like above script can't get the expected results.

Is there any better way?

Community
  • 1
  • 1
You Gakukou
  • 643
  • 7
  • 16

1 Answers1

3

You could try something like this.

First, create your chunker.

chunker = pd.read_table(AUTHORS_PATH, names=['Author ID', 'Author name'], encoding='utf-8', chunksize=10000000)

Now create a set of ids:

ids = set()

Now iterate over the chunks:

for chunk in chunker:
    chunk.drop_duplicates(['Author ID'])

However, now, within the body of the loop, drop also ids already in the set of known ids:

    chunk = chunk[~chunk['Author ID'].isin(ids)]

Finally, still within the body of the loop, add the new ids

    ids.update(chunk['Author ID'].values)

If ids is too large to fit into main memory, you might need to use some disk-based database.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Thanks! I tried it,but the memory still not enough. – You Gakukou Sep 08 '16 at 10:43
  • @yangxg Are you sure the set is the entity taking up your memory? What is its maximum `len`? If that is the problem, we need to escalate gradually. The next thing I would try is [`shelve`](https://docs.python.org/3/library/shelve.html). – Ami Tavory Sep 08 '16 at 11:01
  • would addthing something like this to the loop potentially help? chunk['Author ID'] = pd.to_numeric(chunk['Author ID'], downcast='integer') – Yale Newman Sep 20 '17 at 05:37