0

I have a scenario where I have an existing dataframe and I have a new dataframe which contains rows which might be in the existing frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new dataframe by comparing it with the existing dataframe.

I've done my homework. The solution seems to be to use isin(). However, I find that this has hidden dangers. In particular:

pandas get rows which are NOT in other dataframe

Pandas cannot compute isin with a duplicate axis

Pandas promotes int to float when filtering

Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new that are in existing so that new only contains rows not in existing. A simpler problem of updating existing with new rows from new can be achieved with pd.merge() + DataFrame.drop_duplicates()

In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})  
    ...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})                                                                                             

In [54]: df1                                                                                                                                                                
Out[54]: 
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

In [55]: df2                                                                                                                                                                
Out[55]: 
   col1  col2
0     1    10
1     2    11
2     3    12

In [56]: df1[~df1.isin(df2)]                                                                                                                                                
Out[56]: 
   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3   4.0  13.0
4   5.0  14.0

In [57]: df1[~df1.isin(df2)].dropna()                                                                                                                                       
Out[57]: 
   col1  col2
3   4.0  13.0
4   5.0  14.0
s5s
  • 11,159
  • 21
  • 74
  • 121
  • Possible duplicate of [Anti-Join Pandas](https://stackoverflow.com/questions/38516664/anti-join-pandas) – konvas Nov 01 '19 at 16:59
  • `df1[~df1.isin(df2).all(1)]` ? – anky Nov 01 '19 at 17:01
  • @konvas I'm not sure - I can use merge + drop_duplicates(). However, I need a way to also leave just the new rows in the `new` - for example, I might want to persist those. So it's not just updating `existing` - its also about leaving `new` with just new rows that are not in `existing` – s5s Nov 01 '19 at 17:02
  • 1
    @anky_91 how is this different from `df1[~df1.isin(df2)].dropna()` – s5s Nov 01 '19 at 17:03
  • Sorry if i didn't get the question, if you don't want sequence of rows to be considered, below answer is what you want – anky Nov 01 '19 at 17:27
  • related : [Pandas analogue to SQL MINUS / EXCEPT operator, using multiple columns](https://stackoverflow.com/questions/56235226/pandas-analogue-to-sql-minus-except-operator-using-multiple-columns/56235551#56235551) – anky Nov 01 '19 at 17:40

2 Answers2

1

We can use DataFrame.merge with indicator = True + DataFrame.query and DataFrame.drop

df_filtered=( df1.merge(df2,how='outer',indicator=True)
                 .query("_merge == 'left_only'")
                 .drop('_merge',axis=1) )
print(df_filtered)

   col1  col2
3     4    13
4     5    14

if now for example we change a value of row 0:

df1.iat[0,0]=3

row 0 is no longer filtered

df_filtered=( df1.merge(df2,how='outer',indicator=True)
                 .query("_merge == 'left_only'")
                 .drop('_merge',axis=1) )
print(df_filtered)

   col1  col2
0     3    10
3     4    13
4     5    14

Step by step

df_filtered=( df1.merge(df2,how='outer',indicator=True)
 )
print(df_filtered)
   col1  col2      _merge
0     3    10   left_only
1     2    11        both
2     3    12        both
3     4    13   left_only
4     5    14   left_only
5     1    10  right_only

df_filtered=( df1.merge(df2,how='outer',indicator=True).query("_merge == 'left_only'")
 )
print(df_filtered)
   col1  col2     _merge
0     3    10  left_only
3     4    13  left_only
4     5    14  left_only

df_filtered=( df1.merge(df2,how='outer',indicator=True)
                 .query("_merge == 'left_only'")
                 .drop('_merge',axis=1)
 )
print(df_filtered)
   col1  col2
0     3    10
3     4    13
4     5    14
Community
  • 1
  • 1
ansev
  • 30,322
  • 5
  • 17
  • 31
1

You may try Series isin. It is independent from index. I.e, It only checks on values. You just need to convert columns of each dataframe to series of tuples to create mask

s1 = df1.agg(tuple, axis=1)
s2 = df2.agg(tuple, axis=1)

df1[~s1.isin(s2)]

Out[538]:
   col1  col2
3     4    13
4     5    14
Andy L.
  • 24,909
  • 4
  • 17
  • 29