1

Suppose I have 2 dataframes:

df1 = pd.DataFrame({
    'a': [0, 0, 0, 1, 1, 1, 1], 
    'b': [0, 0, 1, 1, 1, 1, 1], 
})

df2 = pd.DataFrame({
    'a': [0, 0, 0, 1, 1], 
    'b': [0, 0, 0, 1, 1], 
})

I want to compare both these data frames and find all the extra rows in df1 that are not in df2.

The desired output should be like this:

a b
0 1
1 1
1 1

I have tried merge but this creates extra results since there are duplicates and I don't want to remove them.

Is there a good way of approaching this?

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
MAD_DOG
  • 13
  • 3

1 Answers1

0

As the first step, you can find the difference between row counts in each table:

diff = df1.value_counts().sub(df2.value_counts(), fill_value=0)

Result:

a  b
0  0   -1.0
   1    1.0
1  1    2.0
dtype: float64

Based on the previous result you can build a new dataframe with a difference:

lst = []
for (a, b), num in diff[diff > 0].iteritems():
    lst.extend([{'a': a, 'b': b}] * int(num))
    
pd.DataFrame(lst)

Output:

   a  b
0  0  1
1  1  1
2  1  1
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
  • Hi Mkyola, thanks for the help. I just noticed that the output should also include (0,1) since that is not included in df2. Is there an adjustment to your code that would cater for this? – MAD_DOG May 31 '21 at 11:12