I want to create a dataframe which has entries from df
dataframe which don't exist in any of the other dataframes (dfA, dfB, dfC, dfD
). Basically, entries from dfA, dfB, dfC, dfD
are also contained in df
, i.e. df
is the superset of them and n(df) = n(dfA) + n(dfB) + n(dfC) + n(dfD) + n(unclassified)
.
Firstly I'm concatenating all the dataframes and then after cleaning the combined dataframe, I'm dropping the duplicates in place with the argument keep = False
.
After this, the output of the last statement len(unclassified) - (len(df) - (len(dfA) + len(dfB) + len(dfC) + len(dfD)))
should be 0
but it is coming out to be ~47k
. I repeated similar analysis for other combinations as well but some amount of difference persists in every analysis. e.g. In all but dfC
, the difference came out ~ -8.9k
.
I'm concerned as to why is this happening. If anyone can shed a light on where I'm making a mistake or why this difference can be expected, then I'd be grateful.
unclassified = pd.concat([df, dfA, dfB, dfC, dfD])
unclassified = unclassified.reset_index()
unclassified = unclassified.drop(unclassified.columns[[0]], axis = 1)
unclassified.drop_duplicates(inplace = True, keep = False)
len(unclassified) - (len(df) - (len(dfA) + len(dfB) + len(dfC) + len(dfD)))
I have already checked similar questions but none of the OPs faced any issue like mine. SO questions that I consulted before asking this question and might seem similar to this-