6

I have a df (df1) that looks like:

df1 = pd.DataFrame([
        ['YYZ', 'SFO', 1],
        ['YYZ', 'YYD', 1],
        ['YYZ', 'EWR', 1],
        ['YYZ', 'DFW', 1],
        ['YYZ', 'LAX', 1],
        ['YYZ', 'YYC', 1]
    ], columns=['city1', 'city2', 'val'])

I have another df (df2) that is a subset of df1:

df2 = pd.DataFrame([
        ['YYZ', 'SFO', 1],
        ['YYZ', 'YYD', 1]
    ], columns=['city1', 'city2', 'val'])

I want all rows in df1 that are NOT in df2.

I've tried various options described in this post conditional slicing based on values of 2 columns, however I haven't been able to get it to work.

Your help would be appreciated.

Community
  • 1
  • 1
codingknob
  • 11,108
  • 25
  • 89
  • 126
  • 1
    Possible duplicate of [pandas get rows which are NOT in other dataframe](http://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe) – A.Kot Jan 27 '17 at 19:45

3 Answers3

6
  • Use merge with indicator=True
  • Then use query to strip out only those with 'left_only'

df1.merge(
    df2, how='outer', indicator=True
).query('_merge == "left_only"').drop('_merge', 1)

  city1 city2  val
2   YYZ   EWR    1
3   YYZ   DFW    1
4   YYZ   LAX    1
5   YYZ   YYC    1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • This solution produced the correct result. The df1[~df1.isin(df2).all(axis=1)] solution seemed more intuitive but did not generate the correct results for some reason. I will investigate why and post updates – codingknob Jan 29 '17 at 04:05
5

Just ask the question straight in plain English, hmm I mean in plain pandas. "Select all rows in df1 that are not in df2" translates to:

df1[~df1.isin(df2).all(axis=1)]
Out[127]: 
  city1 city2  val
2   YYZ   EWR    1
3   YYZ   DFW    1
4   YYZ   LAX    1
5   YYZ   YYC    1
Zeugma
  • 31,231
  • 9
  • 69
  • 81
  • This solution seemed the most obvious but the results were something I did not expect (not the correct result). I will look further into why and will post some updates. – codingknob Jan 29 '17 at 04:04
  • 1
    Would you care to explain what ~ does in your code df1[~df1.isin(df2)] please? Can't google anything out of it since it's just a symbol. Thanks. – Bowen Liu Oct 29 '18 at 16:04
  • ~ in this case is a negator. It means NOT whatever comes next. So every True becomes False and vice versa. – JDenman6 Nov 16 '20 at 16:17
  • 1
    I updated this from `df1[~df1.isin(df2).all(axis=1)]` to `df1[~df1.isin(df2).any(axis=1)]` and it works much more the way I would expect. – JDenman6 Nov 16 '20 at 16:23
0

the ~ symbol reverses the isin and makes it effectively a isnotin