Pandas join on columns with different names

Question

I have two different data frames that I want to perform some sql operations on. Unfortunately, as is the case with the data I'm working with, the spelling is often different.

See the below as an example with what I thought the syntax would look like where userid belongs to df1 and username belongs to df2. Anyone help me out?

 # not working - I assume some syntax issue?
pd.merge(df1, df2, on = [['userid'=='username', 'column1']], how = 'left')

score 44 · Accepted Answer · answered Nov 13 '16 at 03:32

44

When the names are different, use the xxx_on parameters instead of on=:

pd.merge(df1, df2, left_on=  ['userid', 'column1'],
                   right_on= ['username', 'column1'], 
                   how = 'left')

answered Nov 13 '16 at 03:32

Zeugma

31,231
9
69
81

aichao · Answer 2 · 2018-09-18T16:26:17.983

An alternative approach is to use join setting the index of the right hand side DataFrame to the columns ['username', 'column1']:

df1.join(df2.set_index(['username', 'column1']), on=['userid', 'column1'], how='left')

The output of this join merges the matched keys from the two differently named key columns, userid and username, into a single column named after the key column of df1, userid; whereas the output of the merge maintains the two as separate columns. To illustrate, consider the following example:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'ID': [1,2,3,4,5,6], 'pID' : [21,22,23,24,25,26], 'Values' : [435,33,45,np.nan,np.nan,12]})
##    ID  Values  pID
## 0   1   435.0   21
## 1   2    33.0   22
## 2   3    45.0   23
## 3   4     NaN   24
## 4   5     NaN   25
## 5   6    12.0   26

df2 = pd.DataFrame({'ID' : [4,4,5], 'pid' : [24,25,25], 'Values' : [544, 545, 676]})
##    ID  Values  pid
## 0   4     544   24
## 1   4     545   25
## 2   5     676   25

pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid']))
##    ID  Values_x  pID  Values_y   pid
## 0   1     435.0   21       NaN   NaN
## 1   2      33.0   22       NaN   NaN
## 2   3      45.0   23       NaN   NaN
## 3   4       NaN   24     544.0  24.0
## 4   5       NaN   25     676.0  25.0
## 5   6      12.0   26       NaN   NaN

df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y'))
##    ID  Values_x  pID  Values_y
## 0   1     435.0   21       NaN
## 1   2      33.0   22       NaN
## 2   3      45.0   23       NaN
## 3   4       NaN   24     544.0
## 4   5       NaN   25     676.0
## 5   6      12.0   26       NaN

Here, we also need to specify lsuffix and rsuffix in join to distinguish the overlapping column Value in the output. As one can see, the output of merge contains the extra pid column from the right hand side DataFrame, which IMHO is unnecessary given the context of the merge. Note also that the dtype for the pid column has changed to float64, which results from upcasting due to the NaNs introduced from the unmatched rows.

This aesthetic output is gained at a cost in performance as the call to set_index on the right hand side DataFrame incurs some overhead. However, a quick and dirty profile shows that this is not too horrible, roughly 30%, which may be worth it:

sz = 1000000 # one million rows
df1 = pd.DataFrame({'ID': np.arange(sz), 'pID' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})
df2 = pd.DataFrame({'ID': np.concatenate([np.arange(sz/2),np.arange(sz/2)]), 'pid' : np.arange(0,2*sz,2), 'Values' : np.random.random(sz)})

%timeit pd.merge(df1, df2, how='left', left_on=['ID', 'pID'], right_on=['ID', 'pid'])
## 818 ms ± 33.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df1.join(df2.set_index(['ID','pid']), how='left', on=['ID','pID'], lsuffix='_x', rsuffix='_y')
## 1.04 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas join on columns with different names

2 Answers2

Linked

Related