Difference between two Pandas DataFrames to show how things has changed using merge

Question

I have two DataFrames like these:

In [1]: old_data = {"name": ['a', 'b', 'c', 'd'], "price": [1, 2, 3, 4], "instock": ['yes', 'yes', 'yes', 'no']}                                                      

In [2]: new_data = {"name": ['a', 'b', 'd', 'e'], "price": [1, 2, 3, 4], "instock": ['yes', 'yes', 'yes', 'no']}                                                      

In [3]: columns = ["name", "price", "instock"]                                                                                                                        

In [4]: old_df = pd.DataFrame(old_data, columns=columns)                                                                                                              

In [5]: new_df = pd.DataFrame(new_data, columns=columns)                                                                                                              

In [6]: old_df.head()                                                                                                                                                 
Out[6]: 
  name  price instock
0    a      1     yes
1    b      2     yes
2    c      3     yes
3    d      4      no

In [7]: new_df.head()                                                                                                                                                 
Out[7]: 
  name  price instock
0    a      1     yes
1    b      2     yes
2    d      3     yes
3    e      4      no

Note that the column name is unique.

Now if someone asks me how this data has changed (from old to new), I would say:

Product with the name of c no longer exists.
The price of product d has changed from 4 to 3 and it's now available in stock.
We have a new product with name of e and these details about it.

Now I want to do these automatically. From this question and answer, I can see that I can use merge. Something like this:

In [8]: old_df.merge(new_df, on="name", indicator=True, how="outer")                                                                                                  
Out[8]: 
  name  price_x instock_x  price_y instock_y      _merge
0    a      1.0       yes           1.0       yes               both
1    b      2.0       yes           2.0       yes               both
2    c      3.0       yes           NaN     NaN             left_only
3    d      4.0       no            3.0       yes               both
4    e      NaN     NaN         4.0       no                right_only

Now all the changes are visible, and I can write the report I wanted. But the records that hasn't changed (first two row) are also here. I want to exclude them. I can obviously do something like this:

In [9]: has_changed = lambda x: (x["price_x"] != x["price_y"]) & (x["instock_x"] != x["instock_y"])                                                                   

In [10]: old_df.merge(new_df, on="name", indicator=True, how="outer").loc[has_changed]                                                                                 
Out[10]: 
  name  price_x instock_x  price_y instock_y      _merge
2    c      3.0       yes           NaN      NaN             left_only
3    d      4.0       no            3.0        yes               both
4    e      NaN     NaN         4.0        no                right_only

But this way seems hacky. And in reality there are more columns (13, in my case, with long names), and writing all of them like this doesn't feel right. How can I do this?

score 1 · Accepted Answer · answered May 12 '20 at 13:06

We can solve this by alligning on indices before we merge and check which values are the same for all columns (axis=1).

Then we merge and filter out these rows that did not change:

m = ~old_df.set_index('name').eq(new_df.set_index('name')).all(axis=1)

df = (old_df.merge(new_df, 
                   on="name", 
                   how="outer",
                   suffixes=['_old', '_new'])
      .set_index('name')[m]
      .reset_index()
     )

  name  price_old instock_old  price_new instock_new
0    c        3.0         yes        NaN         NaN
1    d        4.0          no        3.0         yes
2    e        NaN         NaN        4.0          no

maz · Answer 2 · 2020-05-12T12:52:38.463

Use the product name column as index

new_df.set_index("name", inplace=True)
old_df.set_index("name", inplace=True)

Join old and new data to a new DataFrame on the product name

df = old_df.join(new_df, lsuffix="_old", rsuffix="_new")

      price_old instock_old  price_new instock_new
name                                              
a             1         yes        1.0         yes
b             2         yes        2.0         yes
c             3         yes        NaN         NaN
d             4          no        3.0         yes

Do your own calculations to find out which properties have changed, for example

df["price_drop"]  = df["price_old"]-df["price_new"]
df["is_new_instock"]  = (df["instock_old"]=="no")&(df["instock_new"]=="yes")

      price_old instock_old  price_new instock_new  price_drop  is_new_instock
name                                                                          
a             1         yes        1.0         yes         0.0           False
b             2         yes        2.0         yes         0.0           False
c             3         yes        NaN         NaN         NaN           False
d             4          no        3.0         yes         1.0            True

Difference between two Pandas DataFrames to show how things has changed using merge

2 Answers2