Comparing two excel files by using Python

Question

I have data in two excel files like below

Sample DS Created:

df1 =  {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink'], 'Count': [1, 1, 2]}
df1 = pd.DataFrame(df1, columns=df1.keys())

df2 =  {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink'], 'Count': [2, 1, 2]}
df2 = pd.DataFrame(df2, columns=df2.keys())

please help me to get difference of both excels as below..

Transaction_name    Count_df1        Count_df2
SC-001_Homepage          1              2
SC-001_Homepage          1              1
SC-001_Homepage          2              2

First line of the output count is not matching. Will i be able to highlight in different color? Sample code is as below

#COmparing both excels
df1 = pd.read_csv(r"WLMOUTPUT.csv", dtype=object)
df2 = pd.read_csv(r"results.csv", dtype=object)

print('\n', df1)
print('\n',df2)

df1['Compare'] = df1['Transaction_Name'] + df1['Count'].astype(str)
df2['Compare'] = df2['Transaction_Name'] + df2['Count'].astype(str)

print('\n', df1.loc[~df1['Compare'].isin(df2['Compare'])])

Thanks in advance

Please find the work i had done so far to achieve the result #Formatting WLM data data = pd.read_excel(r"Script wise coordinates comparison_edited123.xlsx", sheet_name='WLM', dtype=object) data = pd.DataFrame(data, columns=data.keys()) df = pd.melt(data, id_vars=['Script_name'], value_name='Count') df['Transaction_Name'] = df['Script_name'] + '_' + df['variable'] Final_df = df[['Transaction_Name', 'Count']] Final_df.to_csv(r'WLMOUTPUT.csv', index=False) Code continues in the next comment to compare both CSV files — SG131712, Feb 04 '19 at 05:49
#COmparing both excels df1 = pd.read_csv(r"WLMOUTPUT.csv", dtype=object) df2 = pd.read_csv(r"results.csv", dtype=object) print('\n', df1) print('\n',df2) df1['Compare'] = df1['Transaction_Name'] + df1['Count'].astype(str) df2['Compare'] = df2['Transaction_Name'] + df2['Count'].astype(str) print('\n', df1.loc[~df1['Compare'].isin(df2['Compare'])]) — SG131712, Feb 04 '19 at 05:49
You have to provide it inside your question, you can edit it to adapt accordingly, also you need to format your text — Shailyn Ortiz, Feb 04 '19 at 06:16
@SwethaGorantla the reason no one has answered yet is there is too much info here. I would suggest you to just post 5-6 lines of the sample data which depicts what you are trying to acheive in a dataframe format so we can copy the data and replicate the issue along with 2 lines of explaination , your sample code (not the full, the useful bit only) and an expected output, just those. :) Check [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — anky, Feb 04 '19 at 07:31

score 1 · Accepted Answer · answered Feb 04 '19 at 09:42

1

You can use the merge function.

import pandas as pd

df1 = pd.DataFrame({'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink'], 'Count': [1, 1, 2]}) 
df2 = pd.DataFrame({'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink'], 'Count': [2, 1, 2]})

merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'))

This will give you this DataFrame:

print(merged_df)

   Count_df1   Transaction_Name  Count_df2
0          1    SC-001_Homepage          2
1          1    SC-002_Homepage          1
2          2  SC-001_Signinlink          2

And then you can just use subsetting to see which rows have different counts:

diff = merged_df[merged_df['Count_df1'] != merged_df['Count_df2']]

And you will get this:

print(diff)

   Count_df1 Transaction_Name  Count_df2
0          1  SC-001_Homepage          2

answered Feb 04 '19 at 09:42

Luc Blassel

394
4
15

Thanks Luc Blassel, i could implement it and got required output. Can you also please help me if any of the transactions are missing how to capture it. For example in df1 there are 3 transactions and count of it, in df2 it has only 2 transactions and count of it. How to find that one transaction which is missing in df2? – SG131712 Feb 04 '19 at 10:04
You can specify the `outer` option when merging the 2 DataFrames: `merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='outer')` an then when one of the transactions is missing in one of the DFs it will show us as `NaN` in `merged_df`. If this answered your question, please consider accepting it as the answer. – Luc Blassel Feb 04 '19 at 10:11
As you said it gives NaN at the missing transaction also it gives complete output. Is it possible to get only the missing Transaction name in df2 as below... Transaction_Name Count_df1 Count_df2 0 SC-001_AppLaunch_Signed 0 NaN – SG131712 Feb 04 '19 at 10:58
From [this answer](https://stackoverflow.com/a/14247708/8650928) you can get rows with only `NaN` using `merged_df[merged_df.isnull().any(axis=1)]` – Luc Blassel Feb 04 '19 at 11:02

Comparing two excel files by using Python

Sample DS Created:

1 Answers1