1

I have two data frames:

df1 = pd.DataFrame({
    "name": ["Peter", "John", "Jack", "Mark", "Adam", "Mike", "Aaron", "Mike"],
    "age": [25, 34, 58, 29, 42, 39, 48, 24],
})

df2 = pd.DataFrame({
    "name": ["Mark", "Jack", "Adam", "Mike"],
    "age": [29, 58, 42, 39],
    "is_funny": [False, True, True, False],
})

I want to remove all rows present in df2 from df1 matching in name and age. This would result in the following data frame:

>>> df3
    name  age
0  Peter   25
1   John   34
2  Aaron   48
3   Mike   24

df.isin() only seems to work on ordered rows and checks using index (not the case here).

Thanks in advance

BBQuercus
  • 819
  • 1
  • 11
  • 28

4 Answers4

3

One way is to use outer merge:

df2.merge(df1, on=['name','age'], how='outer', indicator=True)\
   .query('_merge == "right_only"').reindex(df1.columns, axis=1)

Output:

    name  age
4  Peter   25
5   John   34
6  Aaron   48
7   Mike   24
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    i was going to close as a dupe with [pandas merge 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) but maybe it's not a total dupe due to reindex – Umar.H May 11 '21 at 13:23
3

You can use isin on both columns, like:

df1[~df1.set_index(['name','age']).index.isin(df2.set_index(['name','age']).index)]

Out[177]: 
    name  age
0  Peter   25
1   John   34
6  Aaron   48
7   Mike   24
sophocles
  • 13,593
  • 3
  • 14
  • 33
2

Outer Merge will select the data as you described.


import pandas as pd

df1 = pd.DataFrame({
    "name": ["Peter", "John", "Jack", "Mark", "Adam", "Mike", "Aaron", "Mike"],
    "age": [25, 34, 58, 29, 42, 39, 48, 24],
})

df2 = pd.DataFrame({
    "name": ["Mark", "Jack", "Adam", "Mike"],
    "age": [29, 58, 42, 39],
    "is_funny": [False, True, True, False],
})

df3 = df2.merge(df1, on=['name','age'], how='outer', indicator=True).query('_merge == "right_only"').reindex(df1.columns, axis=1)

print(df3)
Joe Thor
  • 1,164
  • 1
  • 11
  • 19
1

One way:

df1[df1.merge(df2, on=['name','age'], how='left', indicator=True)['_merge']=='left_only']

#     name  age
# 0  Peter   25
# 1   John   34
# 6  Aaron   48
# 7   Mike   24

Or:

df3 = pd.concat([df1.assign(source=1),df2.assign(source=2)])
df3 = df3.drop_duplicates(df1.columns, keep=False)
df3 = df3[df3['source'] == 1][df1.columns]

#     name  age
# 0  Peter   25
# 1   John   34
# 6  Aaron   48
# 7   Mike   24
Andreas
  • 8,694
  • 3
  • 14
  • 38