I have a scenario where I have an existing
dataframe and I have a new
dataframe which contains rows which might be in the existing
frame but might also have new rows. I have struggled to find a reliable way to drop these existing rows from the new
dataframe by comparing it with the existing
dataframe.
I've done my homework. The solution seems to be to use isin()
. However, I find that this has hidden dangers. In particular:
pandas get rows which are NOT in other dataframe
Pandas cannot compute isin with a duplicate axis
Pandas promotes int to float when filtering
Is there a way to reliably filter out rows from one dataframe based on membership/containment in another dataframe? A simple usecase which doesn't capture corner cases is shown below. Note that I want to remove rows in new
that are in existing
so that new
only contains rows not in existing
. A simpler problem of updating existing
with new rows from new
can be achieved with pd.merge()
+ DataFrame.drop_duplicates()
In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
In [54]: df1
Out[54]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
In [55]: df2
Out[55]:
col1 col2
0 1 10
1 2 11
2 3 12
In [56]: df1[~df1.isin(df2)]
Out[56]:
col1 col2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 4.0 13.0
4 5.0 14.0
In [57]: df1[~df1.isin(df2)].dropna()
Out[57]:
col1 col2
3 4.0 13.0
4 5.0 14.0