Find the excess rows between two dataframes

Question

Suppose I have 2 dataframes:

df1 = pd.DataFrame({
    'a': [0, 0, 0, 1, 1, 1, 1], 
    'b': [0, 0, 1, 1, 1, 1, 1], 
})

df2 = pd.DataFrame({
    'a': [0, 0, 0, 1, 1], 
    'b': [0, 0, 0, 1, 1], 
})

I want to compare both these data frames and find all the extra rows in df1 that are not in df2.

The desired output should be like this:

a	b
0	1
1	1
1	1

I have tried merge but this creates extra results since there are duplicates and I don't want to remove them.

Is there a good way of approaching this?

Mykola Zotko · Accepted Answer · 2021-05-31T11:08:56.613

0

As the first step, you can find the difference between row counts in each table:

diff = df1.value_counts().sub(df2.value_counts(), fill_value=0)

Result:

a  b
0  0   -1.0
   1    1.0
1  1    2.0
dtype: float64

Based on the previous result you can build a new dataframe with a difference:

lst = []
for (a, b), num in diff[diff > 0].iteritems():
    lst.extend([{'a': a, 'b': b}] * int(num))
    
pd.DataFrame(lst)

Output:

edited May 31 '21 at 11:08

answered May 31 '21 at 10:45

Mykola Zotko

15,583
3
71
73

Hi Mkyola, thanks for the help. I just noticed that the output should also include (0,1) since that is not included in df2. Is there an adjustment to your code that would cater for this? – MAD_DOG May 31 '21 at 11:12

Find the excess rows between two dataframes

1 Answers1