Concatenate two dataframes and then drop all duplicates- not working properly?

Question

I want to create a dataframe which has entries from df dataframe which don't exist in any of the other dataframes (dfA, dfB, dfC, dfD). Basically, entries from dfA, dfB, dfC, dfD are also contained in df, i.e. df is the superset of them and n(df) = n(dfA) + n(dfB) + n(dfC) + n(dfD) + n(unclassified).

Firstly I'm concatenating all the dataframes and then after cleaning the combined dataframe, I'm dropping the duplicates in place with the argument keep = False.

After this, the output of the last statement len(unclassified) - (len(df) - (len(dfA) + len(dfB) + len(dfC) + len(dfD))) should be 0 but it is coming out to be ~47k. I repeated similar analysis for other combinations as well but some amount of difference persists in every analysis. e.g. In all but dfC, the difference came out ~ -8.9k.

I'm concerned as to why is this happening. If anyone can shed a light on where I'm making a mistake or why this difference can be expected, then I'd be grateful.

unclassified = pd.concat([df, dfA, dfB, dfC, dfD])
unclassified = unclassified.reset_index()
unclassified = unclassified.drop(unclassified.columns[[0]], axis = 1)
unclassified.drop_duplicates(inplace = True, keep = False)
len(unclassified) - (len(df) - (len(dfA) + len(dfB) + len(dfC) + len(dfD)))

I have already checked similar questions but none of the OPs faced any issue like mine. SO questions that I consulted before asking this question and might seem similar to this-

Have you tried creating a very small test dataset with a few rows per dataframe and inspecting the output to see what's going on? — Conor, Sep 02 '21 at 10:20
No, I have not. The same command worked for others(in other SO posts included in my post) and also it's mentioned in the Pandas' documentation with small examples. Should I still give it a try? — iamakhilverma, Sep 02 '21 at 10:23
There are really two possibilities, either you are not invoking the commands correctly or there's something about your large dataset that's making it not behave as you expect. A small dataset will allow you to rule out the former possibility, and then focus on narrowing the difference between your small dataset and the real one until you uncover the issue. — Conor, Sep 02 '21 at 10:55

Concatenate two dataframes and then drop all duplicates- not working properly?

0 Answers0