Check which rows of pandas exist in another

Question

I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.

For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).

A simple example of both data frame would be

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50}) 

df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})

in this case the solution would be

df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})

Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.

Moreover, I have also seen this question

but this assumes that one data frame is the subset of another which is not necessarily true in my case.

Yes Sir it is..The exact data type is float64 for all the columns of both data frames. — Sibghat Khan, Feb 20 '18 at 17:01
In your example output, where does the [10, 11] come from for col2 - it looks like 10 and 11 are in df1, but not df2. — sjw, Feb 20 '18 at 17:02
The output should be the first two rows of df1 because they are not present in df2. The next two rows of df1 are present in df2. — Sibghat Khan, Feb 20 '18 at 17:08

jpp · Accepted Answer · 2018-02-20T17:18:16.927

2

Here's one way:

import pandas as pd, numpy as np

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})

x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}

pd.DataFrame(list(x), columns=['col1', 'col2'])

If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].

The reason for the complication is that np.nan == np.nan is considered False.

edited Feb 20 '18 at 17:18

answered Feb 20 '18 at 17:05

jpp

159,742
34
281
339

May I ask why are you using tuple in Map, the reason for saying that is I have 15 columns in actual, not 2...So maybe this works for 2 columns but that's just a short example... – Sibghat Khan Feb 20 '18 at 17:20
@SibghatUllah, sure np. Feel free to accept if it helped (tick on left). – jpp Feb 20 '18 at 19:38

score 1 · Answer 2 · answered Feb 20 '18 at 17:04

1

Here is on solution

pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]: 
   col1  col2
0     1  10.0
1     2  11.0

answered Feb 20 '18 at 17:04

BENY

317,841
20
164
234

nice, better than mine :) +1. I wonder how it works given `np.nan` == `np.nan` is False. – jpp Feb 20 '18 at 17:20
1

@jpp Guess drop_duplicate will skipna and check all real object – BENY Feb 20 '18 at 17:23

Check which rows of pandas exist in another

2 Answers2