Finding the difference between 2 different sized dataframes with Pandas

Question

I have 2 CSV files that were created at different dates that I want to compare and show what remained the same and what has changed. I don’t know where or how to begin, because when I try different merges and joins I run into the issue of the dataframes not being the same size.

df1 :
I    ID            Status         
0   123            Active   
1   124            Active  
2   125            Inactive   
3   126            Active  
4   128            Inactive

df2: 
I    ID            Status         
0   123            Active   
1   124            Inactive 
2   125            Inactive   
3   126            Active  
4   128            Active
5   129            Active  
6   130            Active   
7   131            Active
8   132            Inactive

The goal is to highlight the status changes from df1 to df2 and remained constant from df1 to df2. Using the example above maybe I create 2 separate Dataframes that look something like this:

df3: (containing all new changes)
I    ID              Status           
1    124             Inactive  
4    128             Active 
5    129             Active  
6    130             Active   
7    131             Active

df4: (containing all other ‘Active’ one that remained consistent)
I    ID             Status         
0   123             Active     
3   126             Active

To explain the logic behind each row and why it is included in df3, I will go row by row, because I don’t know if my example is clear enough

df3:
Index 1 - active to inactive
Index 4 - inactive to active
Index 5 - new active row
Index 6 - new active row
Index 7 - new active row
Index 8 - new inactive row

df4:
Index 0 - remained constant
Index 2 - remained constant
Index 3 - remained constant

I don’t know how to approach this because with merge and join I come across the error that the dataframes need to be the same size. Basically, what I want to do find what changed and what stayed the same from df1 to df2. I have 2 sample datasets that I am working with, they have more statuses but the idea is the same. Here is a google sheet with both csv files, the updated_values would be df2 and the original_values would be df1.

You CAN merge different sized dfs, in fact that is the most common purpose of merging. I suggest you try `df_merged = df2.merge(df1, how='left', on='ID')` ...see [pandas merging 101](https://stackoverflow.com/q/53645882/6692898) — RichieV, Aug 10 '20 at 04:00
In addition to setting `how=` in your pandas merge, consider using `diff` (e.g. `diff file_old.csv file_new.csv`) on the command line to do some straightforward file-based exploration. — Josh, Aug 10 '20 at 04:30

rodrigomd · Accepted Answer · 2020-08-11T04:29:55.383

You need to perform a full outer join to get all the entries from both datasets. All the values of df2 that are not in df1 will be filled with NaN values.

df3 = pd.merge(left=df1,right=df2,on='ID',how='outer', indicator=True)

This new df will contain a column 'Status_x' with values of df1, and 'Status_y' with values of df2. Then you can simply create a new column called 'change' to store the changes. You could use boolean indexing to check which columns have changed:

new_rows = df3['_merge'] == 'right_only' # True if the IDs were not in df1
constant = df3['Status_x'] == df3['Status_y'] # True if the Status is the same for both Df

df3['change'] = df3['Status_x'] + ' to ' + df3['Status_y'] # String concatenation to show status change. E.g.: 'Active to Inactive'
df3.loc[new_rows,'change'] = 'New active row' #Sets the value for all new rows
df3.loc[constant,'change'] = 'Remained constant' #Sets the value for columns that remained constant

Finding the difference between 2 different sized dataframes with Pandas

1 Answers1