Computing Set Difference in Pandas between two dataframes

Question

Wondering how to compute set difference in Python's Pandas using two different dataframes.

One dataframe (df1) is of the format:

State  City          Population
NY     Albany        856654
WV     Wheeling      23434
SC     Charleston    35323
OH     Columbus      343534
WV     Charleston    34523

And the second data frame (df2) is

State  City
WV     Wheeling
OH     Columns

And I need an operation that returns the following data frame

State   City        Population
NY      Albany      856654
SC      Charleston  35323
WV      Charleston  34523

Essentially, I can't figure out how to "subtract" df2 from df1 based on the 2 columns (need both since I'll have repeats of city names across different states).

Are you looking for the set difference? http://stackoverflow.com/questions/18180763/set-difference-for-pandas — Tristan, Feb 23 '17 at 20:37
Had trouble making this solution work on my data, since I was dealing with two columns and my data frames didn't have the same structure! — crackernutter, Feb 23 '17 at 20:46

score 13 · Accepted Answer · answered Feb 23 '17 at 20:37

13

Do a left join with indicator which gives information on the origin of each row, then you can filter based on the indicator:

df1.merge(df2, indicator=True, how="left")[lambda x: x._merge=='left_only'].drop('_merge',1)

#State       City   Population
#0  NY      Albany      856654
#2  SC  Charleston       35323
#4  WV  Charleston       34523

answered Feb 23 '17 at 20:37

Psidom

209,562
33
339
356

Thank you! This worked like a charm, although as a relative Pandas newbie, it seems pretty complicated for seems like a standard operation, something done easily in sql (or even brute force). I'll take this, but 'm surprised there isn't a more elegant solution – crackernutter Feb 23 '17 at 21:43
Right. It might be helpful to have *anti-join* implemented in pandas, but it doesn't seem like so. And the work around surely is not elegant, but still acceptable which might be the reason why it's not implemented :). – Psidom Feb 23 '17 at 21:47

score 2 · Answer 2 · answered Feb 23 '17 at 22:33

2

How about filters?

df1[~((df1.City.isin(df2.City)) & (df1.State.isin(df2.State)))]

answered Feb 23 '17 at 22:33

Marcel Flygare

837
10
19

Computing Set Difference in Pandas between two dataframes

2 Answers2

Linked