How to remove common rows in two dataframes in Pandas?

Question

I have two dataframes - df1 and df2.

df1 has row1,row2,row3,row4,row5
df2 has row2,row5

I want to have a new dataframe such that df1-df2. That is, the resultant dataframe should have rows as - row1,row3,row4.

score 19 · Accepted Answer · answered Jul 31 '16 at 07:24

You can use pandas.concat to concatenate the two dataframes rowwise, followed by drop_duplicates to remove all the duplicated rows in them.

In [1]: import pandas as pd
df_1 = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df_2 = pd.DataFrame({"A":["foo", "bar", "foo", "bar"], "B":[1,0,1,0], "C":["A","B","A","B"]})

In [2]: df = pd.concat([df_1, df_2])

In [3]: df
Out[3]: 
     A  B  C
0  foo  0  A
1  foo  1  A
2  foo  1  B
3  bar  1  A
0  foo  1  A
1  bar  0  B
2  foo  1  A
3  bar  0  B

In [4]: df.drop_duplicates(keep=False)
Out[4]: 
     A  B  C
0  foo  0  A
2  foo  1  B
3  bar  1  A

score 11 · Answer 2 · edited May 07 '17 at 16:18

You can use the index.difference() function

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.random.randn(5, 2), index= ['row' + str(i) for i in range(1, 6)])
df1

        0             1
row1    0.249451    -0.107651
row2    1.295390    -1.773707
row3    -0.893647   -0.683306
row4    -1.090551   0.016833
row5    0.864612    0.369138

df2 = pd.DataFrame(np.random.randn(2, 2), index= ['row' + str(i) for i in [2, 5]])
df2

        0           1
row2    0.549396    -0.675574
row5    1.348785    0.942216

df1.loc[df1.index.difference(df2.index), ]

        0           1
row1    0.249451    -0.107651
row3    -0.893647   -0.683306
row4    -1.090551   0.016833

Note that this does not compare the contents of the two dataframes, it simply compares the values of the indexes. — prooffreader, Dec 07 '17 at 21:44

Hamza · Answer 3 · 2023-01-05T09:16:01.213

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                   how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well.

Why this is the best way?

The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. index.difference only works for unique index based comparisons

Manideep Karthik · Answer 4 · 2019-05-06T12:55:25.437

-6

For these kind of questions, see left join in pandas.

edited May 06 '19 at 12:55

answered Dec 04 '18 at 07:00

Manideep Karthik

99
9

1

First it's a left _join_. Second you should answer the question. This belongs in the comments. – Jason Strimpel Apr 27 '19 at 11:27
5

Thanks for the feedback. I am new to SO. – Manideep Karthik May 06 '19 at 12:56

How to remove common rows in two dataframes in Pandas?

4 Answers4

This is the best way to do it:

Why this is the best way?

Linked

Related