0

I'm having a python project:

df_testR with columns={'Name', 'City','Licence', 'Amount'}

df_testF with columns={'Name', 'City','Licence', 'Amount'}

I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same. In case it is not the same, I want to see the difference in Amount_R vs Amount_F.

I referred to: Diff between two dataframes in pandas

But I receive a table with TRUE and FALSE only:

Name City Licence Amount
True True True False

But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:

Name City Licence Amount_R Amount_F
Paul NY YES 200 500.

Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.

Could someone help?

Baobab
  • 55
  • 1
  • 8
  • Are you looking for differences only in the Amount columns? Are differences in the other columns qualify as a difference? – SiP Mar 10 '22 at 14:47
  • I am looking for differences in the Amount columns AND also for differences such as one entry does not at all exist in the other df – Baobab Mar 10 '22 at 14:49

1 Answers1

0
import copy
import pandas as pd

data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)

First find the missing rows and print them:

matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
    print('Some rows are missing from df2:')
    print(df1.iloc[missing_rows, :])

Then drop these rows and merge:

df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)

Now remove the rows that have the same amount:

df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]

I assumed the DFs are sorted. If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.

SiP
  • 1,080
  • 3
  • 8