0

I want to create a dataframe which has entries from df dataframe which don't exist in any of the other dataframes (dfA, dfB, dfC, dfD). Basically, entries from dfA, dfB, dfC, dfD are also contained in df, i.e. df is the superset of them and n(df) = n(dfA) + n(dfB) + n(dfC) + n(dfD) + n(unclassified).

Firstly I'm concatenating all the dataframes and then after cleaning the combined dataframe, I'm dropping the duplicates in place with the argument keep = False.

After this, the output of the last statement len(unclassified) - (len(df) - (len(dfA) + len(dfB) + len(dfC) + len(dfD))) should be 0 but it is coming out to be ~47k. I repeated similar analysis for other combinations as well but some amount of difference persists in every analysis. e.g. In all but dfC, the difference came out ~ -8.9k.

I'm concerned as to why is this happening. If anyone can shed a light on where I'm making a mistake or why this difference can be expected, then I'd be grateful.

unclassified = pd.concat([df, dfA, dfB, dfC, dfD])
unclassified = unclassified.reset_index()
unclassified = unclassified.drop(unclassified.columns[[0]], axis = 1)
unclassified.drop_duplicates(inplace = True, keep = False)
len(unclassified) - (len(df) - (len(dfA) + len(dfB) + len(dfC) + len(dfD)))

I have already checked similar questions but none of the OPs faced any issue like mine. SO questions that I consulted before asking this question and might seem similar to this-

  1. Pandas/Python: How to concatenate two dataframes without duplicates?, and
  2. Concatenate two dataframes and drop duplicates in Pandas
iamakhilverma
  • 564
  • 3
  • 9
  • Have you tried creating a very small test dataset with a few rows per dataframe and inspecting the output to see what's going on? – Conor Sep 02 '21 at 10:20
  • No, I have not. The same command worked for others(in other SO posts included in my post) and also it's mentioned in the Pandas' documentation with small examples. Should I still give it a try? – iamakhilverma Sep 02 '21 at 10:23
  • 1
    There are really two possibilities, either you are not invoking the commands correctly or there's something about your large dataset that's making it not behave as you expect. A small dataset will allow you to rule out the former possibility, and then focus on narrowing the difference between your small dataset and the real one until you uncover the issue. – Conor Sep 02 '21 at 10:55
  • Okay, I'll see what I can do. Thank you – iamakhilverma Sep 02 '21 at 11:07

0 Answers0