0

I have an excel file with a minimum of 600,00 lines (the size varies). I want to get all duplicates of a particular column with Pandas.

This is what I have tried so far:

use_cols = ['ID', 'AMOUNT']

df = pd.DataFrame()

for chunk in pd.read_csv("INPUT.csv", usecols=use_cols, chunksize=10000):
    df = pd.concat([df, chunk])

duplicates = df[df.duplicated(["ID"])]

print(duplicates)

However, the results I get are not duplicates and I'm not sure what I might be doing wrong. Is there a more efficient way to go about this?

2 Answers2

0

I have tried duplicated and I get the rows which are duplicated, That is to say, the first one I do not take into account because it would be unique if the others weren't

enter image description here

blink_182
  • 47
  • 9
0

If you are working with some kind of operation on chunked data that isn't behaving as expectedly, one of the first troubleshooting things you should do is think to think through how that operation is performed on a dataset and then how you would aggregate that operation when it is performed on multiple chunks/partitions. A lot of times, you'll find that there is some way you need to adapt that you haven't actually implemented yet.

The problem here is that you are reading the data in chunks, so a record that is repeated across your file but never appears more than once in a chunk will not show up as a duplicate. There are a variety of ways of dealing with this, including what is called an "External Merge Sort", where you would sort the data first so that duplicates will always be found next to each other. Another user here has a method that takes another approach that may be easier to implement. Because this method is still using a chunker, it shouldn't pose a problem in terms of dealing with all of the data.

LTheriault
  • 1,180
  • 6
  • 15
  • Thanks. However, what you described is not my case because I'm not finding duplicate per chunk, I concatenate all the chunks then find duplicates. – Dave Nathaniel May 05 '20 at 16:37