21

enter image description here

enter image description here

enter image description here

I want to find the values of col1 and col2 where the col1 and col2 of the first dataframe are both in the second dataframe.

These rows should be in the result dataframe:

  1. pizza, boy

  2. pizza, girl

  3. ice cream, boy

because all three rows are in the first and second dataframes.

How do I possibly accomplish this? I was thinking of using isin, but I am not sure how to use it when I have to consider more than one column.

S.G. Harmonia
  • 297
  • 2
  • 18
JungleDiff
  • 3,221
  • 10
  • 33
  • 57

5 Answers5

26

Perform an inner merge on col1 and col2:

import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))

print(pd.merge(df2.reset_index(), df1, how='inner').set_index('index'))

yields

            col1  col2
index                 
10         pizza   boy
11         pizza  girl
16     ice cream   boy

The purpose of the reset_index and set_index calls are to preserve df2's index as in the desired result you posted. If the index is not important, then

pd.merge(df2, df1, how='inner')
#         col1  col2
# 0      pizza   boy
# 1      pizza  girl
# 2  ice cream   boy

would suffice.


Alternatively, you could construct MultiIndexs out of the col1 and col2 columns, and then call the MultiIndex.isin method:

index1 = pd.MultiIndex.from_arrays([df1[col] for col in ['col1', 'col2']])
index2 = pd.MultiIndex.from_arrays([df2[col] for col in ['col1', 'col2']])
print(df2.loc[index2.isin(index1)])

yields

         col1  col2
10      pizza   boy
11      pizza  girl
16  ice cream   boy
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Instead of MultiIndex, you could opt to use df.loc[df.isin(filter_to_apply).sum(axis=1) == len(filter_to_apply.keys()), :] Here, filter to apply is a dictionary with column names as key, and dict values a list of values This takes the row-wise sum of the binary result of df.isin(filter_to_apply), and ensures that we filter rows for which all elements in the row are True. can easily change this filter for other types of filters – flow_me_over Mar 04 '22 at 10:05
12

Thank you unutbu! Here is a little update.

import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))
df1[df1.set_index(['col1','col2']).index.isin(df2.set_index(['col1','col2']).index)]

return:

    col1    col2
1   pizza   boy
4   pizza   girl
5   ice cream   boy
Ningrong Ye
  • 1,117
  • 11
  • 10
3

If somehow you must stick to isin or the negate version ~isin. You may first create a new column, with the concatenation of col1, col2. Then use isin to filter your data. Here is the code:

import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))

df1['indicator'] = df1['col1'].str.cat(df1['col2'])
df2['indicator'] = df2['col1'].str.cat(df2['col2'])

df2.loc[df2['indicator'].isin(df1['indicator'])].drop(columns=['indicator'])

which gives


    col1    col2
10  pizza   boy
11  pizza   girl
16  ice cream   boy

If you do so remember to make sure that concatenating two columns doesn't create false positives e.g. concatenation of 123 and 456 in df1 and concatenation of 12 and 3456 in df2 will match even though their respective columns don't match. You can fix this problem by additional sep parameter.

df1['indicator'] = df1['col1'].str.cat(df1['col2'], sep='$$$')
df2['indicator'] = df2['col1'].str.cat(df2['col2'], sep='$$$')
YuseqYaseq
  • 244
  • 5
  • 16
Azuuu
  • 853
  • 8
  • 17
0

One possible way is to define a check function of your own and perform apply on the dataframe.

For example, if you know the list of combinations that need to be filtered (this list can be extracted beforehand from a dataframe):

filter_list_multicols = [["book", "cat"], ["table", "dog"], ["table", "cat"], ["pen", "horse"], ["book", "horse"]]

Then you could define a check function as so:

def isin_multicols_check(stationary_name, animal_name):
    for filter_pair in filter_list_multicols:
        if (stationary_name == filter_pair[0]) and (animal_name == filter_pair[1]):
                return True

    return False

Example dataframe:

df = pd.DataFrame([
                   [1, "book", "dog"], [2, "pen", "dog"], [3, "pen", "rat"], [4, "book", "horse"], [5, "book", "cat"]
                  ],
                   columns=["S.N.", "stationary_name", "animal_name"])
df
S.N.    stationary_name  animal_name
1           book            dog
2           pen             dog
3           pen             rat
4           book            horse
5           book            cat

And now, call the function using pandas apply:

df["is_in"] = df.apply(lambda x: isin_multicols_check(x.stationary_name, x.animal_name), axis=1)
df
S.N.    stationary_name  animal_name    is_in
1           book            dog         false
2           pen             dog         false
3           pen             rat         false
4           book            horse       true
5           book            cat         true

The result:

is_in = df[df["is_in"]==True]
not_is_in = df[df["is_in"]==False]
Bikash Gyawali
  • 969
  • 2
  • 15
  • 33
-2

The best way is to pass a dict to isin()

As https://www.oreilly.com/library/view/mastering-exploratory-analysis/9781789619638/eb563c9a-83e1-4e0c-82d7-6f83addc3340.xhtml suggests.

Also the documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html shows another example of how to pass a dictionary.

George C
  • 1,168
  • 13
  • 30
  • 1
    The link to the O'Reilly answer is 404, which is why Stack Exchange prefers people give answers instead of merely quoting them. – Michael Tiemann Jan 07 '23 at 18:25