Joining two dfs based on different col names?

Question

I have two dataframes such as:

dfa:
Name  |   ID   |   Amount
Bob      V434      50.00
Jill     B333      22.11
Hank     B442      11.11

dfb:
Name | ID_First | ID_Second | ID_Third
Bob      V434      E333        B442
Karen    V434      E333        B442
Jill     V434      E333        B442
Hank     V434      E333        B442

I want to join dfa to dfb, but the ID in dfa only corresponds to one of the IDS in dfb.

Is there a way I can join dfa to dfb for ID in dfa so basically if it matches any of the ids in dfb then I can match amount from dfa?

Required output would just be:

Name | ID_First | ID_Second | ID_Third | Amount 
    Bob      V434      E333        B442     50.00
    Jill    V434      E333        B442      22.11
    Hank     V434      E333        B442     11.11

Basically join on Name that exists in both tables, but the ID that exists in dfa exists in dfb under only one of the ID_First, second or third columns so the amount that matches for the same name and same ID value but that ID value is only in one of the IDS for dfb.

Thanks

Where that `55.00` amount came from and why `Jill` is not listed on the desired output as well? — accdias, Dec 10 '19 at 09:37
Do we assume that the name is unique ? The IDs don't seem to be unique thus merging on them will be problematic. Also do you just want to drop the ID column from dfa in the final output? — Karan Shishoo, Dec 10 '19 at 09:38
yes the name is unique, the output is just an example and not with real exact data — Chris90, Dec 10 '19 at 09:45

score 0 · Answer 1 · answered Dec 10 '19 at 09:55

You could attempt a merge on all three, though not sure how efficient that would be. This wouldn't account work for when you have multiple matches across IDs, if such a thing is possible. The following might work;

new_df = pd.DataFrame()
for col in ['ID_First', 'ID_Second', 'ID_Third']:
  df = pd.merge(dfa, dfb, left_on='ID', right_on=col, how='left')
  new_df = df if new_df.empty else new_df.append(df)

I don't think you can have an 'OR' condition in pd.merge.

This is another possibility;

Python Pandas: How to merge based on an "OR" condition?

score 0 · Answer 2 · answered Dec 10 '19 at 10:04

You can do 3 inner joins with each of your id columns and concatenate them

df1 = pd.DataFrame([['Bob','V434',50.00],['Jill','E333',22.11],['Hank','B442',11.11]],
                   columns=['Name','ID','Amount'])

df2 = pd.DataFrame([['Bob','V434','E333','B442'],
                    ['Karen','V434','E333','B442'],
                    ['Jill','V434','E333','B442'],
                    ['Hank','V434','E333','B442']],
                   columns=['Name','ID_First','ID_Second','ID_Third'])

print(pd.concat([df1.merge(df2, left_on=['ID','Name'], right_on=['ID_First','Name']),
                 df1.merge(df2, left_on=['ID', 'Name'], right_on=['ID_Second', 'Name']),
                 df1.merge(df2, left_on=['ID', 'Name'], right_on=['ID_Third', 'Name'])])[['Name','ID','Amount']])

Output:

   Name    ID  Amount
0   Bob  V434   50.00
0  Jill  E333   22.11
0  Hank  B442   11.11

Improvising on @Ian's answer to get the desired output:

new_df = pd.DataFrame()
for col in ['ID_First', 'ID_Second', 'ID_Third']:
  df = pd.merge(df1, df2, left_on=['ID','Name'], right_on=[col,'Name'], how='inner')
  new_df = df if new_df.empty else new_df.append(df)

CypherX · Answer 3 · 2019-12-10T10:18:31.477

Solution

You can do this with a simple merge statement as follows.

pd.merge(dfa[['Name', 'Amount']], dfb, how='inner', on='Name')

Note: While merging dfa and dfb, the columns, dfa.ID and dfb.ID do not act like primary keys, neither are their values unique. The only thing that matters here is to inner join dfa and dfb using the "Name" column.

Output:

For Reproducibility

You may load the data and test the solution given above, using the following code-block

import numpy as np
import pandas as pd
from io import StringIO

# Example Data
dfa = """
Name  |   ID   |   Amount
Bob   |  V434  |   50.00
Jill  |  B333  |   22.11
Hank  |  B442  |   11.11
"""
dfb = """
Name  | ID_First | ID_Second | ID_Third
Bob   |  V434    | E333      | B442
Karen |  V434    | E333      | B442
Jill  |  V434    | E333      | B442
Hank  |  V434    | E333      | B442
"""

# Load Data and Clean up empty spaces 
# in headers and columns
dfa = pd.read_csv(StringIO(dfa), sep='|')
dfb = pd.read_csv(StringIO(dfb), sep='|')
dfa.columns = dfa.columns.str.strip()
dfb.columns = dfb.columns.str.strip()
for col in dfa.columns:
    if col=='Amount':
        dfa[col] = dfa[col].astype(str).str.strip().astype(float)
    else:    
        dfa[col] = dfa[col].str.strip()
for col in dfb.columns:
    dfb[col] = dfb[col].str.strip()

# merge dfa and dfb: Note that dfa.ID and dfb.ID do not act 
# like primary keys, neither are their values unique. 
# The only thing that matters here is to inner join dfa 
# and dfb using the "Name" column.  

pd.merge(dfa[['Name', 'Amount']], dfb, how='inner', on='Name')

Joining two dfs based on different col names?

3 Answers3

Solution

For Reproducibility