How to compare a pandas dataframe with an updated version of it?

Question

I have a data frame and an updated version of the same data frame with the colour in row 1 changed, row 2 deleted and one row appended.

I would like to append rows with a new ID to the old data frame , then compare both data frames and write the comparison result (e. g. "New entry", "Updated colour", "Entry deleted" or "No changes") into a "Compare" column.

   Name  Colour      ID Compare
0  Lisa     Red   Apple        
1  Anna    Blue  Banana        
2  Anna  Yellow  Orange        
3   Max   Green    Pear   


    Name  Colour      ID
0   Lisa  Purple   Apple
1   Anna  Yellow  Orange
2  Peter    Pink   Grape

I have tried several approaches with .iloc and .where, however I am too unexperienced with selection/manipulation, so it didn't work out. Here is what I am trying to achieve:

   Name  Colour      ID         Compare
0  Lisa     Red   Apple  Colour changed
1  Anna    Blue  Banana   Entry deleted
2  Anna  Yellow  Orange      No changes
3   Max   Green    Pear       New entry

I'm thankful for any help.

This is to create the data frames:

import pandas as pd

data = {'Name': ['Lisa', 'Anna', 'Anna', 'Max'],
        'Colour': ['Red', 'Blue', 'Yellow', 'Green'],
        'ID': ['Apple', 'Banana', 'Orange', 'Pear'],
        'Compare': ['','','','']}

df = pd.DataFrame(data, columns = ['Name', 'Colour', 'ID', 'Compare'])

updatedDf = df.copy()
updatedDf = updatedDf.iloc[:, :-1]
updatedDf.set_value(0, 'Colour', 'Purple')
updatedDf = updatedDf.drop(1)
newrow = ['Peter', 'Pink', 'Grape']
updatedDf.loc[len(updatedDf)] = newrow
updatedDf = updatedDf.reset_index(drop=True)

Have a look at this question/answer: https://stackoverflow.com/questions/44752968/using-lambda-function-pandas-to-set-column-values — ypnos, Sep 26 '19 at 09:47

maow · Accepted Answer · 2019-09-26T12:39:11.043

You can create a DataFrame with the old and new version side by side with an outer join

>> df2 = pd.merge(df, updatedDf, on ='ID', how='outer', suffixes=['', '_update'])
   Name  Colour      ID Compare Name_update Colour_update
0  Lisa     Red   Apple                Lisa        Purple
1  Anna    Blue  Banana                 NaN           NaN
2  Anna  Yellow  Orange                Anna        Yellow
3   Max   Green    Pear                 NaN           NaN
4   NaN     NaN   Grape     NaN       Peter          Pink

Now you need to define how to recognize each of your cases:

If Name and Colour are NaN, the entry is new
If Name_updated and Colour_updated are NaN the entry was deleted
If Name and Colour are both equal Name_updated and Colour_updated, the entry was not changed
If Name and Name_updated are equal but Color and Colour_updated are not, the colour was changed
Vice versa if name changed
If Name and Colour are both not equal Name_updated and Colour_updated, you did not define the expected behaviour yet

Beware that this does not take care of edge cases e.g. when only one field was deleted or IDs are not unique

You can encapsulate all that conditionals into a function and use apply or just do it by copy-paste like so

df2.loc[df2[['Name', 'Colour']].isnull().any(axis=1), 'Compare'] = 'New entry'
df2.loc[df2[['Name_updated', 'Colour_updated']].isnull().any(axis=1), 'Compare'] = 'Entry deleted'
df2.loc[(~df2[['Name', 'Colour', 'Name_updated', 'Colour_updated']].isnull().any(axis=1)) & (df2['Name'] == df2['Name_updated']) & (df2['Colour'] == df2['Colour_updated']), 'Compare'] = 'No changes'
df2.loc[(~df2[['Name', 'Colour', 'Name_updated', 'Colour_updated']].isnull().any(axis=1)) & (df2['Name'] == df2['Name_updated']) & (df2['Colour'] != df2['Colour_updated']), 'Compare'] = 'Colour changed'
df2.loc[(~df2[['Name', 'Colour', 'Name_updated', 'Colour_updated']].isnull().any(axis=1)) & (df2['Name'] != df2['Name_updated']) & (df2['Colour'] == df2['Colour_updated']), 'Compare'] = 'Name changed'
df2.loc[(~df2[['Name', 'Colour', 'Name_updated', 'Colour_updated']].isnull().any(axis=1)) & (df2['Name'] != df2['Name_updated']) & (df2['Colour'] != df2['Colour_updated']), 'Compare'] = 'Name and colour changed'

Although it is slightly complicated, the check that none of the entries is NaN is necessary in the last 4 statements. Comparison with NaN are always true, so it is a bit safer this way.

Take the new values whereever you can

df2['Name'].update(df2['Name_updated'])                                                                                                                                                                   
df2['Colour'].update(df2['Colour_updated'])

The Series.update method used here skips the lines where values where deleted automatically.

In the end you may throw away the temporary columns.

df2.drop(['Name_updated', 'Colour_updated'], axis=1, inplace=True)                                                                                                                                                      

    Name  Colour      ID         Compare
0   Lisa  Purple   Apple  Colour changed
1   Anna    Blue  Banana   Entry deleted
2   Anna  Yellow  Orange      No changes
3    Max   Green    Pear   Entry deleted
4  Peter    Pink   Grape       New entry

How to compare a pandas dataframe with an updated version of it?

1 Answers1