50

I have two dataframes:

 df1 = row1;row2;row3
 df2 = row4;row5;row6;row2

I want my output dataframe to only contain the rows unique in df1, i.e.:

df_out = row1;row3

How do I get this most efficiently?

This code does what I want, but using 2 for-loops:

a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})

match_ident = []
for i in range(0,len(a)):
    found=False
    for j in range(0,len(b)):
        if a[0][i]==b[0][j]:
            if a[1][i]==b[1][j]:
                found=True
    match_ident.append(not(found))

a = a[match_ident]
RRC
  • 1,342
  • 2
  • 11
  • 17
  • Not a duplicate, since I don't have an unique id which maps to the common values in both dataframes. – RRC Jun 22 '17 at 18:14
  • Can't flag it, but https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe – victor Jun 22 '17 at 18:18

2 Answers2

78

You an use merge with parameter indicator and outer join, query for filtering and then remove helper column with drop:

DataFrames are joined on all columns, so on parameter can be omit.

print (pd.merge(a,b, indicator=True, how='outer')
         .query('_merge=="left_only"')
         .drop('_merge', axis=1))
   0   1
0  1  10
2  3  30
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • It would be better to use 'left' join in this case. If we're joining on an index or column(s) rather than full dataframe there might be missing values in other coluimns. If so, int values in such column would be converted to float thus loosing precision (imagine you have a long int index to database there). Using left join would guarantee that all columns in left df would not be converted. Also left join would preserve key order rather than sorting them lexicographically. – Emsi May 12 '22 at 15:56
19

You could convert a and b into Indexs, then use the Index.isin method to determine which rows are shared in common:

import pandas as pd
a = pd.DataFrame({0:[1,2,3],1:[10,20,30]})
b = pd.DataFrame({0:[0,1,2,3],1:[0,1,20,3]})

a_index = a.set_index([0,1]).index
b_index = b.set_index([0,1]).index
mask = ~a_index.isin(b_index)
result = a.loc[mask]
print(result)

yields

   0   1
0  1  10
2  3  30
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677