21

I have two dataframes - df1 and df2.

df1 has row1,row2,row3,row4,row5
df2 has row2,row5

I want to have a new dataframe such that df1-df2. That is, the resultant dataframe should have rows as - row1,row3,row4.

user3243499
  • 2,953
  • 6
  • 33
  • 75

4 Answers4

19

You can use pandas.concat to concatenate the two dataframes rowwise, followed by drop_duplicates to remove all the duplicated rows in them.

In [1]: import pandas as pd
df_1 = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df_2 = pd.DataFrame({"A":["foo", "bar", "foo", "bar"], "B":[1,0,1,0], "C":["A","B","A","B"]})

In [2]: df = pd.concat([df_1, df_2])

In [3]: df
Out[3]: 
     A  B  C
0  foo  0  A
1  foo  1  A
2  foo  1  B
3  bar  1  A
0  foo  1  A
1  bar  0  B
2  foo  1  A
3  bar  0  B

In [4]: df.drop_duplicates(keep=False)
Out[4]: 
     A  B  C
0  foo  0  A
2  foo  1  B
3  bar  1  A
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
11

You can use the index.difference() function

import numpy as np
import pandas as pd

df1 = pd.DataFrame(np.random.randn(5, 2), index= ['row' + str(i) for i in range(1, 6)])
df1

        0             1
row1    0.249451    -0.107651
row2    1.295390    -1.773707
row3    -0.893647   -0.683306
row4    -1.090551   0.016833
row5    0.864612    0.369138

df2 = pd.DataFrame(np.random.randn(2, 2), index= ['row' + str(i) for i in [2, 5]])
df2

        0           1
row2    0.549396    -0.675574
row5    1.348785    0.942216

df1.loc[df1.index.difference(df2.index), ]

        0           1
row1    0.249451    -0.107651
row3    -0.893647   -0.683306
row4    -1.090551   0.016833
suku
  • 10,507
  • 16
  • 75
  • 120
Olivier Ma
  • 1,269
  • 13
  • 24
  • 1
    Note that this does not compare the contents of the two dataframes, it simply compares the values of the indexes. – prooffreader Dec 07 '17 at 21:44
5

This is the best way to do it:

df = df1.drop_duplicates().merge(df2.drop_duplicates(), on=df2.columns.to_list(), 
                   how='left', indicator=True)
df.loc[df._merge=='left_only',df.columns!='_merge']

Note that drop duplicated is used to minimize the comparisons. It would work without them as well.

Why this is the best way?

The best way is to compare the row contents themselves and not the index or one/two columns and same code can be used for other filters like 'both' and 'right_only' as well to achieve similar results. index.difference only works for unique index based comparisons

Hamza
  • 5,373
  • 3
  • 28
  • 43
-6

For these kind of questions, see left join in pandas.