2

I have two dataframes like this

import pandas as pd

df1 = pd.DataFrame(
    {
        'A': list('abcaewar'),
        'B': list('ghjglmgb'),
        'C': list('lkjlytle'),
        'ignore': ['stuff'] * 8
    }
)

df2 = pd.DataFrame(
    {
        'A': list('abfu'),
        'B': list('ghio'),
        'C': list('lkqw'),
        'stuff': ['ignore'] * 4
    }
)

and I would like to remove all rows in df1 where A, B and C are identical to values in df2, so in the above case the expected outcome is

   A  B  C ignore
0  c  j  j  stuff
1  e  l  y  stuff
2  w  m  t  stuff
3  r  b  e  stuff

One way of achieving this would be

comp_columns = ['A', 'B', 'C']
df1 = df1.set_index(comp_columns)
df2 = df2.set_index(comp_columns)

keep_ind = [
    ind for ind in df1.index if ind not in df2.index
]

new_df1 = df1.loc[keep_ind].reset_index()

Does anyone see a more straightforward way of doing this which avoids the reset_index() operations and the loop to identify non-overlapping indices, e.g. by a mart way of masking? Ideally, I don't have to hardcode the columns, but can define them in a list as above as I sometimes need 2, sometimes 3 or sometimes 4 or more columns for the removal.

Cleb
  • 25,102
  • 20
  • 116
  • 151

1 Answers1

1

Use DataFrame.merge with optional parameter indicator=True, then use boolean masking to filter the rows in df1:

df3 = df1.merge(df2[['A', 'B', 'C']], on=['A', 'B', 'C'], indicator=True, how='left')
df3 = df3[df3.pop('_merge').eq('left_only')]

Result:

# print(df3)

   A  B  C ignore
2  c  j  j  stuff
4  e  l  y  stuff
5  w  m  t  stuff
7  r  b  e  stuff
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    Quite elegant, thanks (upvoted); did not know about `indicator`, that is quite a handy one... – Cleb Jul 09 '20 at 10:34