I am working on a program that would allow me to output differences of two csv files. Below is example with some dummy data.
Input:
import pandas as pd
df1 = pd.DataFrame([['vegetable', 100, 'orange', 12, 'carrot'],
['fruit', 150, 'green', 10, 'apple'],
['vegetable', 175, 'white', 3, 'colifower'],
['fruit', 210, 'orange', 1, 'tangerine'],
['vegetable', 75, 'black', 120, 'old_carrot']],
columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])
df2 = pd.DataFrame([['vegetable', 100, 'orange', 12, 'carrot'],
['fruit', 150, 'green', 10, 'apple'],
['fruit', 200, 'red', 1, 'apple'],
['vegetable', 500, 'green', 2, 'cabbage'],
['vegetable', 170, 'white', 4, 'colifower'],
['fruit', 210, 'orange', 1, 'tangerine'],
['vegetable', 70, 'black', 122, 'very_old_carrot']],
columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])
Desired output:
extra_df = pd.DataFrame([['fruit', 200, 'red', 1, 'apple'],
['vegetable', 500, 'green', 2, 'cabbage']],
columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])
changed_df = pd.DataFrame([['vegetable', '175 -> 170', 'white', '3 -> 4', 'coliflower'],
['vegetable', '75 -> 70', 'black', '120 -> 122', 'carrot']],
columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])
As we have no key (ID) problem is to match changed rows from two dataframes and distinct the rows that are completely new.
To overcome this I came up with following concept:
Define key choosing multiple columns (and raise exception if key is not unique), then sort rows using defined key, cut away excessive rows (number of rows in second dataframe is always higher) and compare dataframes.
Defining a key is on a user side.
I have no clue how to use MultiIndex as a key or if MultiIndex is the right way to achieve my goal. Maybe there is whole different approach ex. outer merge? I have looked at multiple solutions but none of them covers adding a key and sorting.
Any tips appreciated.