Check if rows of one DataFrame are present in another

Question

I have two dataframes:

DF1

 A   B
'a' 'x' 
'b' 'y'
'c' 'z'

DF2

Col1 Col2
'j'  'm'
'a'  'x'
'k'  'n'
'b'  'y'

And want to look up if the rows of DF1 are contained in DF2, and add that column Bool_col to DF1, like this.

DF1

 A   B   Bool_col
'a' 'x'  True
'b' 'y'  True
'c' 'z'  False

I've tried by looking up the concatenation of A and B in the concatenation-list of Col1 and Col2, but my data is giving me unexpected trouble. Any help on how to do this without concatenating columns?

`merge` with `indicator` – user3483203 May 15 '19 at 04:53 — user3483203, May 15 '19 at 04:53

score 3 · Answer 1 · answered May 15 '19 at 05:11

Use merge with the indicator argument, then check what rows show "both".

df1['Bool_col'] = (df1.merge(df2, 
                             how='left', 
                             left_on=['A', 'B'], 
                             right_on=['Col1', 'Col2'], 
                             indicator=True)
                      .eval('_merge == "both"'))

df1
     A    B  Bool_col
0  'a'  'x'      True
1  'b'  'y'      True
2  'c'  'z'     False

Sociopath · Accepted Answer · 2019-05-15T05:39:47.773

3

Use pandas.merge and numpy.where

df = df1.merge(df2, how='left', indicator=True, left_on=['A','B'], right_on=['Col1','Col2'])
df['Bool_col'] = np.where(df['_merge']=='both', True, False)
df.drop(['_merge','Col1','Col2'], 1, inplace=True)
print(df)

Output:

   A  B     Bool_col
0  a  x      True
1  b  y      True
2  c  z     False

Edit

As per @cs95 suggested in comments, np.where is unnecessary here. You can simply do

df1['Bool_col'] = df['_merge']=='both'
# df.drop(['_merge','Col1','Col2'], 1, inplace=True)

edited May 15 '19 at 05:39

answered May 15 '19 at 05:14

Sociopath

13,068
19
47
75

1

This isn't wrong, but it's more unwieldy than it needs to be. – cs95 May 15 '19 at 05:23
@cs95 I am still figuring out what `.eval` does in your answer. Can you please elaborate that? – Sociopath May 15 '19 at 05:25
3

`eval` is handy for condensing multiple lines into a single. Typically to perform a conditional on a column, you'd need to store the result as a variable and access the resultant column in a subsequent statement. With eval, you can [do this dynamically](https://stackoverflow.com/a/53779987/4909087) using string evaluation. – cs95 May 15 '19 at 05:28
PS: The unwieldy part I was referring to was `df['Bool_col'] = np.where(df['_merge']=='both', True, False)` which could simply be `df['Bool_col'] = df['_merge']=='both'`. You could also assign the result to `df1` instead of creating `df` and dropping columns. – cs95 May 15 '19 at 05:29
@cs95 Thanks for explanation and pointing out unnecessary part. Corrected. – Sociopath May 15 '19 at 05:35
1

Looks better now, but you should still/also consider assigning back to df1. In general, df2 could have more columns than just col1 and col2, so dropping may or may not be sufficient here. – cs95 May 15 '19 at 05:35
I have a question which might be important: does df.merge(how='left', ...) guarantee that the order of the elements in df will remain the same? – Marses Dec 12 '19 at 10:39

Check if rows of one DataFrame are present in another

2 Answers2