1

I have two dataframes. If in df1 I have a person with name and birth date like in df2, i want to delete all rows with this name and birth date from df1. How i can do it using pandas?

df1=

Full name Birth date param1 param2 param10
Name 1 date1 something something something
Name 2 date2 something something something
Name 3 date3 something something something
Name 4 date4 something something something

df2=

Full name Birth date param11 param12 param20
Name 1 date1 something something something
Name 2 date2 something something something
wjandrea
  • 28,235
  • 9
  • 60
  • 81

5 Answers5

2

Another possible solution:

d1 = df1.set_index(['Full name', 'Birth date'])
d2 = df2.set_index(['Full name', 'Birth date'])

d1.loc[d1.index.difference(d2.index)].reset_index()

Output:

  Full name Birth date     param1     param2    param10
0    Name 3      date3  something  something  something
1    Name 4      date4  something  something  something
PaulS
  • 21,159
  • 2
  • 9
  • 26
1

here is one way :

merged = df1.merge(df2, on=['Full name', 'Birth date'], how='outer', indicator=True)

# Filter out rows that are present in both df1 and df2
out = merged[merged['_merge'] == 'left_only'][df1.columns]

output:

  Full name Birth date     param1     param2    param10
2    Name 3      date3  something  something  something
3    Name 4      date4  something  something  something
eshirvana
  • 23,227
  • 3
  • 22
  • 38
  • Might be simpler to do `df1.merge(df2[['Full name', 'Birth date']], how='left', indicator=True)` then at the end, `.drop(columns='_match')`. – wjandrea Aug 01 '23 at 18:00
  • With a full outer join, the last step also filters out rows that are only in `df2`, though there aren't any in this example. – wjandrea Aug 01 '23 at 18:02
0

Short Answer

Note: If my understanding of your question is correct, you should be able to do this using the snippet of python pandas code below.

merged_df = pd.merge(df1, df2[['Full name', 'Birth date']], on=['Full name', 'Birth date'], how='left', indicator=True)
df1 = merged_df[merged_df['_merge'] == 'left_only'].drop(columns=['_merge'])

where df1 only contains the rows which didn't have entries in df2

Additional details

The operation you're trying to perform is called an anti-join. In your case, you're trying to remove rows from df1 where there's a matching name and birthdate in df2.

  • First, you'd want to ensure that 'Full name' and 'Birth date' are of the same data type in both dataframes. This is necessary to ensure the merge operation works correctly.
    • If 'Birth date' is a string in both dataframes, there's no problem. But if it's a datetime type, you need to ensure both are in the same format.
  • To perform the anti-join, you can merge df1 and df2 on 'Full name' and 'Birth date' using a left join, and then keep only the rows where 'Full name' and 'Birth date' from df2 are null.
    • The indicator=True argument adds a column to the output DataFrame called _merge with information on the source of each row. The values are 'left_only', 'right_only', or 'both' depending on the source of the data. Rows with a '_merge' value of 'left_only' are those that were in df1 but not in df2, which is what you want.
  • After this operation, df1 will only contain rows that are not in df2.
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Curious
  • 325
  • 1
  • 10
  • *"If 'Birth date' is a string in both dataframes, there's no problem. But if it's a datetime type, you need to ensure both are in the same format."* -- I think you have those backwards, cause datetimes don't have format. – wjandrea Aug 01 '23 at 17:28
0

You could do a left join with indicator (see Pandas Merging 101) then select the rows that came from the left.

keys = ['Full name', 'Birth date']
left_only = (
    df1[keys].merge(df2[keys].drop_duplicates(), how='left', indicator=True)
    ['_merge'].eq('left_only')
    )
df1[left_only]
  Full name Birth date     param1     param2    param10
2    Name 3      date3  something  something  something
3    Name 4      date4  something  something  something

The .drop_duplicates() isn't necessary in this example, but I added it just in case.

wjandrea
  • 28,235
  • 9
  • 60
  • 81
0

As pointed out, if you have duplicates in your original DF this will drop them too - but since it's not indicated in your question I'll leave it up.

You could concat just the columns you want to check for dupes on and drop on the same set of columns:

out =  (pd.concat([df1,
                   df2[['Full name', 'Birth date']]])
          .drop_duplicates(subset=['Full name','Birth date'],
                           keep=False)
    )
Chris
  • 15,819
  • 3
  • 24
  • 37