How to remove rows in a Pandas dataframe if the same row exists in another dataframe?

Question

I have two dataframes:

 df1 = row1;row2;row3
 df2 = row4;row5;row6;row2

I want my output dataframe to only contain the rows unique in df1, i.e.:

df_out = row1;row3

How do I get this most efficiently?

This code does what I want, but using 2 for-loops:

a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})

match_ident = []
for i in range(0,len(a)):
    found=False
    for j in range(0,len(b)):
        if a[0][i]==b[0][j]:
            if a[1][i]==b[1][j]:
                found=True
    match_ident.append(not(found))

a = a[match_ident]

Not a duplicate, since I don't have an unique id which maps to the common values in both dataframes. — RRC, Jun 22 '17 at 18:14
Can't flag it, but https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe — victor, Jun 22 '17 at 18:18

score 78 · Accepted Answer · answered Jun 22 '17 at 18:26

78

You an use merge with parameter indicator and outer join, query for filtering and then remove helper column with drop:

DataFrames are joined on all columns, so on parameter can be omit.

print (pd.merge(a,b, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))
   0   1
0  1  10
2  3  30

answered Jun 22 '17 at 18:26

jezrael

822,522
95
1,334
1,252

It would be better to use 'left' join in this case. If we're joining on an index or column(s) rather than full dataframe there might be missing values in other coluimns. If so, int values in such column would be converted to float thus loosing precision (imagine you have a long int index to database there). Using left join would guarantee that all columns in left df would not be converted. Also left join would preserve key order rather than sorting them lexicographically. – Emsi May 12 '22 at 15:56

score 19 · Answer 2 · answered Jun 22 '17 at 18:33

You could convert a and b into Indexs, then use the Index.isin method to determine which rows are shared in common:

import pandas as pd
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})

a_index = a.set_index([0,1]).index
b_index = b.set_index([0,1]).index
mask = ~a_index.isin(b_index)
result = a.loc[mask]
print(result)

yields

   0   1
0  1  10
2  3  30

How to remove rows in a Pandas dataframe if the same row exists in another dataframe?

2 Answers2

Linked