1
Source Target
Source Target

Two tables will be joined via composite keys ID1 and ID2 via pandas merge. We have a data testing tool that does the data analysis. It will perform the merge and filter whatever rows are not present on the left or right side of the merge into an external Missing In Source or Missing in Target table.

Missing in Target Missing in Source
Missing in Target Missing in Source

In this example in Source the second row with composite key ABC,345 is missing in Target. So that row will be filtered into Missing in Target. Similarly in Target the second row with composite key ABC,222 is missing in Source so it will be filtered into Missing in Source.

For the Missing ... tables the business wants to know "why exactly are they missing"--which composite key by row made the row missing? For example, for the row with ABC,345 in the Missing in Target above, ABC was present in both table rows but 345 wasn't. Therefore,ID2 with value 345 is the guilty key for this row.

I should also mention that the version of pandas we are using is 0.23 unfortunately.

GaryChin
  • 29
  • 6
  • 1
    Adding some `python`/`pandas` code that generate your 4 tables would be helpful to potential answerers. – rickhg12hs Sep 09 '22 at 22:23
  • 1
    Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – BeRT2me Sep 09 '22 at 22:50
  • [Why should I not upload images of code/data/errors when asking a question?](https://meta.stackoverflow.com/q/285551/3404097) [Why are images of text, code and mathematical expressions discouraged?](https://meta.stackexchange.com/q/320052/266284) – philipxy Sep 10 '22 at 01:05
  • What is the question? [ask] [Help] PS Where are you stuck? [mre] – philipxy Sep 10 '22 at 01:15
  • Please before considering posting: Pin down code issues via [mre]. Read the manual/reference & google any error message & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers, 'site:stackoverflow.com' & tags; read many answers. Reflect research in posts. SO/SE search is poor & literal & unusual, read the help. Google re googling/searching, including Q&A at [meta] & [meta.se]. [How much research effort is expected?](https://meta.stackoverflow.com/q/261592/3404097) [ask] [Help] PS This post contains no such clear phrasing. – philipxy Sep 10 '22 at 01:23
  • Create a composite key from the attributes that weren't used as composite keys initially in the target and source tables and perform an inner join. Any matches will provide you with the initial composite keys that were different. – Michael Gardner Sep 10 '22 at 21:13

1 Answers1

1

I believe you want .compare:

import pandas as pd


data1 = {
    "ID1": ["XYZ", "ABC", "EFG", "IJK"],
    "ID2": [123, 345, 789, 111],
    "DATA1": ["car", "truck", "cat", "lion"],
    "DATA2": ["house", "sky", "mouse", "zion"]
}

data2 = {
    "ID1": ["XYZ", "ABC", "ZAB", "IJK"],
    "ID2": [123, 222, 789, 111],
    "DATA1": ["car", "truck", "cat", "lion"],
    "DATA2": ["house", "sky", "mouse", "zion"]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df = (
    df1
    .compare(df2, align_axis=0)
    .rename(index={"self": "df1", "other": "df2"}, level=-1)
    .fillna("")
    .reset_index()
    .drop("level_0", axis=1)
    .rename(columns={"level_1": "source"})
)
print(df)

  source  ID1    ID2
0    df1       345.0
1    df2       222.0
2    df1  EFG       
3    df2  ZAB  

 
Jason Baker
  • 3,170
  • 2
  • 12
  • 15
  • Thank you. It is a shame we are using version 0.23 still. I believe compare is only available from 1.1. Nonetheless, you have given me some ideas. – GaryChin Sep 10 '22 at 15:36