I have two DataFrames like these:
In [1]: old_data = {"name": ['a', 'b', 'c', 'd'], "price": [1, 2, 3, 4], "instock": ['yes', 'yes', 'yes', 'no']}
In [2]: new_data = {"name": ['a', 'b', 'd', 'e'], "price": [1, 2, 3, 4], "instock": ['yes', 'yes', 'yes', 'no']}
In [3]: columns = ["name", "price", "instock"]
In [4]: old_df = pd.DataFrame(old_data, columns=columns)
In [5]: new_df = pd.DataFrame(new_data, columns=columns)
In [6]: old_df.head()
Out[6]:
name price instock
0 a 1 yes
1 b 2 yes
2 c 3 yes
3 d 4 no
In [7]: new_df.head()
Out[7]:
name price instock
0 a 1 yes
1 b 2 yes
2 d 3 yes
3 e 4 no
Note that the column name
is unique.
Now if someone asks me how this data has changed (from old to new), I would say:
- Product with the name of
c
no longer exists. - The price of product
d
has changed from 4 to 3 and it's now available in stock. - We have a new product with name of
e
and these details about it.
Now I want to do these automatically. From this question and answer, I can see that I can use merge
. Something like this:
In [8]: old_df.merge(new_df, on="name", indicator=True, how="outer")
Out[8]:
name price_x instock_x price_y instock_y _merge
0 a 1.0 yes 1.0 yes both
1 b 2.0 yes 2.0 yes both
2 c 3.0 yes NaN NaN left_only
3 d 4.0 no 3.0 yes both
4 e NaN NaN 4.0 no right_only
Now all the changes are visible, and I can write the report I wanted. But the records that hasn't changed (first two row) are also here. I want to exclude them. I can obviously do something like this:
In [9]: has_changed = lambda x: (x["price_x"] != x["price_y"]) & (x["instock_x"] != x["instock_y"])
In [10]: old_df.merge(new_df, on="name", indicator=True, how="outer").loc[has_changed]
Out[10]:
name price_x instock_x price_y instock_y _merge
2 c 3.0 yes NaN NaN left_only
3 d 4.0 no 3.0 yes both
4 e NaN NaN 4.0 no right_only
But this way seems hacky. And in reality there are more columns (13, in my case, with long names), and writing all of them like this doesn't feel right. How can I do this?