Compare df's including detailed insight in data

Question

I'm having a python project:

df_testR with columns={'Name', 'City','Licence', 'Amount'}

df_testF with columns={'Name', 'City','Licence', 'Amount'}

I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same. In case it is not the same, I want to see the difference in Amount_R vs Amount_F.

I referred to: Diff between two dataframes in pandas

But I receive a table with TRUE and FALSE only:

Name	City	Licence	Amount
True	True	True	False

But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:

Name	City	Licence	Amount_R	Amount_F
Paul	NY	YES	200	500.

Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.

Could someone help?

Are you looking for differences only in the Amount columns? Are differences in the other columns qualify as a difference? — SiP, Mar 10 '22 at 14:47
I am looking for differences in the Amount columns AND also for differences such as one entry does not at all exist in the other df — Baobab, Mar 10 '22 at 14:49

SiP · Accepted Answer · 2022-03-10T15:48:54.940

import copy
import pandas as pd

data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)

First find the missing rows and print them:

matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
    print('Some rows are missing from df2:')
    print(df1.iloc[missing_rows, :])

Then drop these rows and merge:

df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)

Now remove the rows that have the same amount:

df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]

I assumed the DFs are sorted. If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

Compare df's including detailed insight in data

1 Answers1