I've been trying to compare 2 columns with similar entries from 2 different excel files

Question

I'm new to pandas and I'm trying to check entries from 2 different excel sheets.

Df1-

          Date   Value Date       Narration  Cod  Debit     Credit  Balance
0  02-Apr-2021  02-Apr-2021  BY XYZ company  TRF      0    1112.00 -3244213
1  02-Apr-2022  02-Apr-2022  BY XYZ company  CLR   2424       0.00 -3244212
2  02-Apr-2023  02-Apr-2023  BY XYZ company  TRF      0    9894.00 -3244211
3  02-Apr-2024  02-Apr-2024  BY XYZ company  TRF  32234  130000.03 -3244210
4  02-Apr-2025  02-Apr-2025  BY XYZ company  TRF      0   45435.00 -3244209
5  02-Apr-2026  02-Apr-2026  BY XYZ company  CLR    983       0.00 -3244208

Df2 -

           Date Unnamed: 1       Company Name Vch Type  Debit     Credit
0 2021-04-01   TEAM XYZ  QWERTY123 (21-23)  Receipt      0   45435.00
1 2021-04-02   TEAM XYZ  QWERTY123 (21-24)  Payment  32234       0.00
2 2021-04-03   TEAM XYZ  QWERTY123 (21-25)  Receipt      0    9894.00
3 2021-04-04   TEAM XYZ  QWERTY123 (21-26)  Payment   2424  130000.03
4 2021-04-05   TEAM XYZ  QWERTY123 (21-27)  Receipt      0    1112.00

Here's what I'm trying to do. Both files have similar entries and I need to get the value that's not in either file. For ex: the value 983 (df1) has no match in df2. I want 983 in the output. Also,there can be n entries of the same value , for ex : there can be n entries of the value 9894.00 in debit column of df1 and it has to find the same value in df2. If there are 10 entries of 9894.00 in file 1 there has to be 10 entries of 9894.00 in df2 as well.

Here's what I've tried so far :

import pandas as pd
df1 = pd.read_excel(file1)
df2 = pd.read_excel(file2)
report = pd.merge(df1,df2)
print(report)

But this doesn't really get the expected output.

@mozway df1 - {'Date': '02-Apr-2021', 'Desc': 'CC bill', 'Debit': 0.0, 'Credit': 70774.0, 'Balance': 2313432} and df2 - {'Date': '02-Apr-2021', 'Particulars': 'New_Payment', 'Company Name': 'XYZ Company', 'Type': 'Payment', 'Debit': 312, 'Credit': 0, 'Balance': 34332} — rishab ajain445, Sep 10 '22 at 08:24
There must be something missing, this should be a nested dictionary — mozway, Sep 10 '22 at 08:26
yea my bad-please find the correct output here - https://controlc.com/d3940b20 — rishab ajain445, Sep 10 '22 at 08:41
@rishabajain445 Please [edit](https://stackoverflow.com/posts/73670266/edit) your post. There is no need for an external link. In future, please do not post images of data or code. This is hard to copy'n'past. — mosc9575, Sep 10 '22 at 08:58
As said above, edit your post and provide the shown dataframes, not the unrelated ones. The question is unclear at the moment. — mozway, Sep 10 '22 at 09:15
I've edited the post, please let me know if I have to make any more changes. @mosc9575 — rishab ajain445, Sep 10 '22 at 09:35
@mozway I've edited the post and pasted the related dataframes. — rishab ajain445, Sep 10 '22 at 09:42

score 1 · Accepted Answer · answered Sep 10 '22 at 10:40

If your requirement was just to find out unmatching entries, then the solution was very easy (thanks to Pandas Merging 101):

df1[["debit"]].merge(df2[["debit"]], on=["debit"], how="outer", indicator=True) \
     .query('_merge != "both"') \
     .drop("_merge", 1)

But you've the requirement to match repeated entries and report unmatched repeated entries, too. So, you need to adopt some work around. There are many ways to achieve this. Taking reference from my answer to a previous post, I'm proposing the following way:

PS - I've shown the processing for "debit" column. You can repeat it for "credit" column. As I understand from the question, processing of these two columns is not related.

Assume dataset:

df1 = pd.DataFrame(data=[(34, 56), (34, 21), (34, ), (453, 22)], columns=["debit", "credit"])
df2 = pd.DataFrame(data=[(34, 56), (453, 21), (453, 23)], columns=["debit", "credit"])

Expected output: Two unmached 34 and one 453 (under "debit" column).

Group by debit to get count of repeated values:

df1_grp = df1.groupby("debit").agg(cnt=("debit", len))
>>        cnt
>> debit     
>> 34       3
>> 453      1

df2_grp = df2.groupby("debit").agg(cnt=("debit", len))
>>        cnt
>> debit     
>> 34       1
>> 453      2

Create serial numbers for each repeated value:

df1_grp["serial_no"] = df1_grp.apply(lambda row: [i for i in range(1, row["cnt"]+1)], axis=1)
>>        cnt  serial_no
>> debit                
>> 34       3  [1, 2, 3]
>> 453      1        [1]

df2_grp["serial_no"] = df2_grp.apply(lambda row: [i for i in range(1, row["cnt"]+1)], axis=1)
>>        cnt serial_no
>> debit               
>> 34       1       [1]
>> 453      2    [1, 2]

Move serial numbers to individual rows:

df1_expl = df1_grp[["serial_no"]].explode("serial_no").dropna().reset_index()
>>    debit serial_no
>> 0     34         1
>> 1     34         2
>> 2     34         3
>> 3    453         1

df2_expl = df2_grp[["serial_no"]].explode("serial_no").dropna().reset_index()
>>    debit serial_no
>> 0     34         1
>> 1    453         1
>> 2    453         2

Result: Join the two dataframes by debit value and serial number. The result is the unmatched entries, including repeated entries:

df_result = df1_expl.merge(df2_expl, on=["debit", "serial_no"], how="outer", indicator=True) \
     .query('_merge != "both"') \
     .drop("_merge", 1)

>>    debit serial_no
>> 1     34         2
>> 2     34         3
>> 4    453         2

this is amazing. But what if I only want the unmatched entries and not the repeated entries in the final result? — rishab ajain445, Sep 10 '22 at 11:37
The result gives you only the unmatched entries, irrespective of repeated/unrepeated. — Azhar Khan, Sep 10 '22 at 11:59
Right. Just figured it out. Thank you so much. This definitely helped a lot. Would love to get to your level of expertise in pandas someday :) — rishab ajain445, Sep 10 '22 at 12:04

I've been trying to compare 2 columns with similar entries from 2 different excel files

1 Answers1