1

I am working on a program that would allow me to output differences of two csv files. Below is example with some dummy data.

Input:

import pandas as pd
df1 = pd.DataFrame([['vegetable', 100, 'orange', 12, 'carrot'],
                  ['fruit', 150, 'green', 10, 'apple'],
                  ['vegetable', 175, 'white', 3, 'colifower'],
                  ['fruit', 210, 'orange', 1, 'tangerine'],
                  ['vegetable', 75, 'black', 120, 'old_carrot']],
                  columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])

df2 = pd.DataFrame([['vegetable', 100, 'orange', 12, 'carrot'],
                  ['fruit', 150, 'green', 10, 'apple'],              
                  ['fruit', 200, 'red', 1, 'apple'],
                  ['vegetable', 500, 'green', 2, 'cabbage'],
                  ['vegetable', 170, 'white', 4, 'colifower'],
                  ['fruit', 210, 'orange', 1, 'tangerine'],
                  ['vegetable', 70, 'black', 122, 'very_old_carrot']],
                  columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])

Desired output:

extra_df = pd.DataFrame([['fruit', 200, 'red', 1, 'apple'],
                        ['vegetable', 500, 'green', 2, 'cabbage']],
                        columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])

changed_df = pd.DataFrame([['vegetable', '175 -> 170', 'white', '3 -> 4', 'coliflower'],
                         ['vegetable', '75 -> 70', 'black', '120 -> 122', 'carrot']],
                         columns=['type', 'weight', 'colour', 'days_in_basket', 'name'])

As we have no key (ID) problem is to match changed rows from two dataframes and distinct the rows that are completely new.

To overcome this I came up with following concept:
Define key choosing multiple columns (and raise exception if key is not unique), then sort rows using defined key, cut away excessive rows (number of rows in second dataframe is always higher) and compare dataframes. Defining a key is on a user side.

I have no clue how to use MultiIndex as a key or if MultiIndex is the right way to achieve my goal. Maybe there is whole different approach ex. outer merge? I have looked at multiple solutions but none of them covers adding a key and sorting.

Any tips appreciated.

1 Answers1

1

First left outer join the two dataframe together.

>>> combined_df=df2.merge(df1,on=['type','colour','name'],how='left') 
>>> combined_df
        type  weight_x  colour  days_in_basket_x        name  weight_y  \
0  vegetable       100  orange                12      carrot     100   
1      fruit       150   green                10       apple     150  
2      fruit       200     red                 1       apple     NaN   
3  vegetable       500   green                 2     cabbage     NaN   
4  vegetable       170   white                 4   colifower     175   
5      fruit       210  orange                 1   tangerine     210   
6  vegetable        70   black               122  old_carrot      75   

   days_in_basket_y  
0              12 
1              10 
2             NaN  
3             NaN  
4               3  
5               1  
6             120

When The two dataframes are joined together , the extra rows in df2 will have Nan in the columns from df1. So, we need to filter those rows with Nan to get the extra_df .

>>> extra_df=df2[combined_df['weight_y'].isnull()].reset_index(drop='True')
>>> extra_df
        type  weight colour  days_in_basket     name
0      fruit     200    red               1    apple
1  vegetable     500  green               2  cabbage

To get the change_df, first the extra rows needed to be removed.Also, rows without any change are needed to be removed.

>>> no_extra_df=combined_df[~combined_df['weight_y'].isnull()]
>>> changed_df_temp=no_extra_df[(no_extra_df['weight_x']!=no_extra_df['weight_y']) & (no_extra_df['days_in_basket_x']!=no_extra_df['days_in_basket_y'])].reset_index(drop=True)
>>> changed_df_temp
        type  weight_x colour  days_in_basket_x        name  weight_y  \
0  vegetable       170  white                 4   colifower     175   
1  vegetable        70  black               122  old_carrot      75   

   days_in_basket_y  
0               3  
1             120

Finally the columns weight_x,weight_y and days_in_basket_x,days_in_basket_y should be joined with ->

>>> change_df_temp['weight']=change_df_temp.apply(lambda x : str(x['weight_y'])+'->'+str(x['weight_x']),axis=1)
>>> change_df_temp['days_in_basket']=change_df_temp.apply(lambda x : str(x['days_in_basket_y'])+'->'+str(x['days_in_basket_x']),axis=1)
>>> 
>>> change_df=change_df_temp[['type', 'weight', 'colour', 'days_in_basket', 'name']]
>>> change_df
        type      weight colour days_in_basket        name
0  vegetable    175->170  white           3->4   colifower
1  vegetable      75->70  black       120->122  old_carrot
Sam17
  • 91
  • 5
  • Thank you, this resolution looks really good. What concerns me me is the key and sorting I mentioned in my question. In your answer `on=['type','colour','name']` is responsible for choosing _the key_. In my example of output I made a mistake and you probably changed `very_old_carrot` to `old_carrot` to get the right answer. But let's assume that we have inputs as they were defined at the beginning. Using 'type', 'colour' and 'name' we will loose change of `old_carrot` name. How to detect this situation and rise an exception? – Leszek Bulawa Nov 06 '17 at 09:12
  • This answer works only for provided 'dummy' values. I would like to use this program for different csv files / dataframes. – Leszek Bulawa Nov 06 '17 at 12:13