I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1
and df2
. I'm interested in finding the rows of df1
which are not present in df2
. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.
For example, i
th observation in df1
may be j
th observation in df2
and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).
A simple example of both data frame would be
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})
in this case the solution would be
df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})
Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2)
but it does not provide the accurate results that I want.
Moreover, I have also seen this question
but this assumes that one data frame is the subset of another which is not necessarily true in my case.