I have the following two dataframes:
df2
has 2166 unique records.
I am trying to return the data in the blue area so I am using the following technique:
To get the green data I simply perform an inner join:
df_common = df1.merge(df2, how='inner' on='SKU')
which yields 413 unique records
To get the blue area, I appended the green data to df2
:
df2_plus_green = df2.append(df_common, ignore_index = True)
which yields 2579 records (which makes sense (413 + 2166 = 2579)
Then I dropped the duplicates:
df_blue = df2_plus_green.drop_duplicates(keep=False, subset=['SKU'])
which yields 1666 records. I would have thought it would yield 1753 records (2166 - 413 = 1753)
I have tried the above technique on a much smaller set of data and it works as expected. I assume the extra 87 records that are being dropped are duplicates somehow but I have checked each dataframe at each step along the way and all dataframes are made up of completely unique records. Can anyone point me in the right direction? I am sure it is something obvious that I'm missing.