1

I have two pandas dataframes, one that is a list of states, cities, and a capital flag with a multiIndex of (state, city), and another that is a non-indexed (or default indexed, if that's more appropriate) list of states and their capitals, I need to perform an inner join on the two and then also find out which items in the cities df are NOT in the join.

Cities:

                  capital
state    city
Ohio     Akron          N
         Toledo         N
         Columbus       N
Colorado Boulder        N
         Denver         N

States:

           state        city
0  West Virginia  Charleston
1           Ohio    Columbus

Inner join to find the capital of Ohio:

pd.merge(cities, states, on=['state', 'city'], how='inner')
  state      city capital
0  Ohio  Columbus       N

Now I need to get a df that includes everything in the cities df EXCEPT Columbus, Ohio. I've been looking at variations of .isin(), both with and without reset_index(), but I can't get it work.

Code to create the cities and states dfs. I have set_index() as a separate call because if I try to do it when I create the df I get an error about ValueError: Shape of passed values is (3, 3), indices imply (2, 3) and haven't figured out a way around it.

cities = pd.DataFrame({'state':['Ohio', 'Ohio', 'Ohio', 'Colorado', 'Colorado'], 'city':['Akron', 'Toledo', 'Columbus', 'Boulder', 'Denver'], 'capital':['N', 'N', 'N', 'N', 'N']}, columns=['state', 'city', 'capital'])
cities.set_index(('state', 'city'))
states = pd.DataFrame({'state':['West Virginia', 'Ohio'], 'city':['Charleston', 'Columbus']})
jml
  • 137
  • 1
  • 10
  • Possible duplicate of [how to identify whats NOT in the inner join during dataframe merge](https://stackoverflow.com/questions/40518639/how-to-identify-whats-not-in-the-inner-join-during-dataframe-merge) – Ben.T Oct 28 '19 at 00:49
  • or [this one](https://stackoverflow.com/questions/38242368/anti-merge-in-pandas-python) – Ben.T Oct 28 '19 at 00:53

1 Answers1

4

IIUC, you could use merge with how='outer' and indicator='source', and the keep only those that are 'left_only':

merge = cities.merge(states, on=['state', 'city'], how='outer', indicator='source')
result = merge[merge.source.eq('left_only')].drop('source', axis=1)
print(result)

Output

      state     city capital
0      Ohio    Akron       N
1      Ohio   Toledo       N
3  Colorado  Boulder       N
4  Colorado   Denver       N

As an alternative you could use isin, in the following way:

mask = ~cities.reset_index().city.isin(states.city)
print(cities[pd.Series(data=mask.values, index=cities.index)])

Output

                 capital
state    city           
Ohio     Akron         N
         Toledo        N
Colorado Boulder       N
         Denver        N

The idea of the second approach is to create a boolean mask with an index matching the one in cities. A variation on the second approach is the following:

# drop the index
re_indexed = cities.reset_index()

# find the mask
mask = ~re_indexed.city.isin(states.city)

# reindex back
result = re_indexed[mask].set_index(['state', 'city'])

print(result)
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • 1
    I got flagged as a duplicate; the suggested solutions were the same as your first one using `indicator`, but I really like the use of `isin()` in the second method, which I didn't see in the others. – jml Oct 28 '19 at 01:15
  • @jml Glad I could help! – Dani Mesejo Oct 28 '19 at 01:16