Diff of two Dataframes

Question

I need to compare two dataframes of different size row-wise and print out non matching rows. Lets take the following two:

df1 = DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})

df2 = DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})

What is the most efficient way to row-wise over df2 and print out rows not in df1 e.g.

Buyer     Quantity 
Carl         2
Mark         1

Important: I do not want to have row:

Buyer     Quantity 
Carl         3

Included in the diff:

I have already tried: Comparing two dataframes of different length row by row and adding columns for each row with equal value and Compare two DataFrames and output their differences side-by-side

But these do not match with my problem.

EdChum · Accepted Answer · 2016-04-27T15:23:32.190

134

merge the 2 dfs using method 'outer' and pass param indicator=True this will tell you whether the rows are present in both/left only/right only, you can then filter the merged df after:

In [22]:
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']

Out[22]:
  Buyer  Quantity      _merge
3  Carl         2  right_only
4  Mark         1  right_only

edited Apr 27 '16 at 15:23

answered Apr 27 '16 at 15:03

EdChum

376,765
198
813
562

1

If a row appears once in one dataframe, and twice in the other - this method will not be able to detect that. Ideally, one of the entries should be marked as "right_only" and the other one as being in both. – Tobias Bergkvist Feb 16 '22 at 19:51
1

See @TobiasBergkvist answer below to account for duplicate rows – PJ_ Jun 24 '22 at 13:45

score 29 · Answer 2 · answered May 01 '19 at 10:19

29

you may find this as the best:

df2[ ~df2.isin(df1)].dropna()

answered May 01 '19 at 10:19

A. Nadjar

2,440
2
19
20

This doesn't work if the DataFrame has existing NAs. – csteel Feb 05 '23 at 14:56

score 11 · Answer 3 · answered Jun 21 '19 at 13:40

11

@EdChum's answer is self-explained. But using not 'both' condition makes more sense and you do not need to care about the order of comparison, and this is what a real diff supposed to be. For the sake of answering your question:

merged = df1.merge(df2, indicator=True, how='outer')
merged.loc = [merged['_merge'] != 'both']

answered Jun 21 '19 at 13:40

Shayan Amani

5,787
1
39
40

1

This is a full anti join, OP asked for right anti join – cs95 Jun 29 '20 at 17:53
3

The second line fails and should be swapped to: `merged.loc[merged['_merge'] != 'both']` – John Stud Oct 10 '21 at 16:02

score 9 · Answer 4 · answered Apr 27 '16 at 15:07

9

diff = set(zip(df2.Buyer, df2.Quantity)) - set(zip(df1.Buyer, df1.Quantity))

This is the first solution that came to mind. You can then put the diff set back in a DF for presentation.

answered Apr 27 '16 at 15:07

Shovalt

6,407
2
36
51

score 8 · Answer 5 · answered Sep 24 '20 at 10:52

8

As of Pandas 1.1.0, there is pandas.DataFrame.compare:

df1.compare(df2)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html

answered Sep 24 '20 at 10:52

gwelter

1,054
11
14

6

This doesn't work in inequal indices (aka, two dataframes with different number of rows) – Ali Naderi May 05 '22 at 09:56
1

Yep, from the notes in the doc page: _Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames_ – gwelter May 05 '22 at 18:24

score 4 · Answer 6 · answered Jul 15 '18 at 04:58

4

Try the following if you only care about adding the new Buyers to the other df:

df_delta=df2[df2['Buyer'].apply(lambda x: x not in df1['Buyer'].values)]

answered Jul 15 '18 at 04:58

Yasir

59
3

Tobias Bergkvist · Answer 7 · 2022-02-19T17:35:28.123

An important edge case

Consider the following, where you have an additional duplicate entry in the second dataframe. ('Carl', 5)

df1 = DataFrame({ 'Buyer':    ['Carl', 'Carl', 'Carl'],
                  'Quantity': [   18 ,     3 ,     5 ]  })

df2 = DataFrame({ 'Buyer':    ['Carl', 'Mark', 'Carl', 'Carl', 'Carl'],
                  'Quantity': [    2 ,     1 ,    18 ,     5 ,     5 ]  })

EdChum's answer will give you the following:

merged = df1.merge(df2, indicator=True, how='outer')
print(merged[merged['_merge'] == 'right_only'])

  Buyer  Quantity      _merge
4  Carl         2  right_only
5  Mark         1  right_only

As you can see, the solution ignores the additional duplicate value, which depending on what you are doing is something you want to avoid.

Here is a solution that more likely does what you want:

df1['duplicate_counter'] = df1.groupby(list(df1.columns)).cumcount()
df2['duplicate_counter'] = df2.groupby(list(df2.columns)).cumcount()
merged = df1.merge(df2, indicator=True, how='outer')
merged[merged['_merge'] == 'right_only']

  Buyer  Quantity  duplicate_counter      _merge
3  Carl         2                  0  right_only
4  Mark         1                  0  right_only
5  Carl         5                  1  right_only

The duplicate counter ensures that every row is unique, which means that duplicate values are not removed. After merging, you can drop the duplicate_counter.

score 1 · Answer 8 · answered Mar 23 '23 at 11:47

There is datacompy e.g. as well. It allows to export some string based comparison report like this:

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        5     6
1       new        4     5

Column Summary
--------------

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 1
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 5
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 7

Columns with Unequal Values or Types
------------------------------------

       Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0  dollar_amt        float64   float64          1    0.0500            0
1   float_fld        float64   float64          4    0.0005            3
2        name         object    object          2    0.0000            0

Sample Rows with Unequal Values
-------------------------------

       acct_id  dollar_amt (original)  dollar_amt (new)
0  10000001234                 123.45             123.4

       acct_id  float_fld (original)  float_fld (new)
0  10000001234            14530.1555        14530.155
5  10000001238                   NaN          111.000
2  10000001236                   NaN            1.000
1  10000001235                1.0000              NaN

       acct_id name (original)            name (new)
0  10000001234  George Maharis  George Michael Bluth
3  10000001237      Bob Loblaw         Robert Loblaw

Sample Rows Only in original (First 10 Columns)
-----------------------------------------------

       acct_id  dollar_amt           name  float_fld    date_fld
4  10000001238        1.05  Lucille Bluth        NaN  2017-01-01

Diff of two Dataframes

8 Answers8

An important edge case

Linked

Related