4

I have two dataframes, dfA and dfB of identical columns. I'm looking to get only the records from dataframe dfB that are not present in dataframe dfA.

to be clear, I do not want to get the records in dfA that are not in dfB.

I managed to hack something together that works using this, but the code is not easy to understand and by extension not very pythonic.

I'm looking for a more elegant solution, perhaps using pandas join/merge/append but wasn't able to make it work.

Example of what I want:

  dfA: 
  Date       Category  Price 
1 2013-11-24 Coat      22.1
2 2013-11-24 Shirt     8.7
3 2013-11-01 Socks     9     <<< Only present in this df

  dfB:
  Date       Category  Price 
1 2013-11-24 Coat      22.1
2 2013-11-24 Shirt     8.7
3 2013-11-24 Helmet    2.1   <<< Only present in this df
4 2013-11-24 Pants     10.7  <<< Only present in this df

  Result:
  Date       Category  Price 
1 2013-11-24 Helmet    2.1
2 2013-11-24 Pants     10.7
Idodo
  • 1,322
  • 11
  • 18

3 Answers3

6

One of the idiomatic approaches is to use merge(..., how='outer', indicator=True) and to filter resulting DF by generated column _merge:

In [18]: (A.merge(B, how='outer', indicator=True)
           .query("_merge == 'right_only'")
           .drop('_merge',1))
Out[18]:
         Date Category  Price
3  2013-11-24   Helmet    2.1
4  2013-11-24    Pants   10.7
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
2

Use merge with outer join and filter by indicator column with boolean indexing:

df = pd.merge(dfA, dfB, indicator=True, how='outer')
print (df)
         Date Category  Price      _merge
0  2013-11-24     Coat   22.1        both
1  2013-11-24    Shirt    8.7        both
2  2013-11-01    Socks    9.0   left_only
3  2013-11-24   Helmet    2.1  right_only
4  2013-11-24    Pants   10.7  right_only

df1 = df[df['_merge'] == 'right_only'].drop('_merge', axis=1)
print (df1)
         Date Category  Price
3  2013-11-24   Helmet    2.1
4  2013-11-24    Pants   10.7
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

In case you have a unique column in your table you are filtering by. Perhaps by Category in your case.

You could use :

dfB[~dfB['Category'].isin(dfA['Category'])]
Ajit
  • 667
  • 2
  • 14
  • 27