1

I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.

For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).

A simple example of both data frame would be

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50}) 

df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})

in this case the solution would be

df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})

Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.

Moreover, I have also seen this question

but this assumes that one data frame is the subset of another which is not necessarily true in my case.

Avery
  • 2,270
  • 4
  • 33
  • 35
Sibghat Khan
  • 147
  • 1
  • 9

2 Answers2

2

Here's one way:

import pandas as pd, numpy as np

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})

x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}

pd.DataFrame(list(x), columns=['col1', 'col2'])

If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].

The reason for the complication is that np.nan == np.nan is considered False.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • May I ask why are you using tuple in Map, the reason for saying that is I have 15 columns in actual, not 2...So maybe this works for 2 columns but that's just a short example... – Sibghat Khan Feb 20 '18 at 17:20
  • @SibghatUllah, sure np. Feel free to accept if it helped (tick on left). – jpp Feb 20 '18 at 19:38
1

Here is on solution

pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]: 
   col1  col2
0     1  10.0
1     2  11.0
BENY
  • 317,841
  • 20
  • 164
  • 234