Vlookup in Pandas - Join or Merge?

Question

I am trying to replicate a Vlookup Excel function using pandas. I have used Join and Merge and both methods give me wrong results.

Df1 has 15 columns with integers and text values and Df2 has 6 columns with mostly text.

I am trying to bring User details from Df2 into Df1 using the column label 'Created By'.

Df1 looks like this:

 CA#   CreatedBy  $
9xxx12  User 1      10
9xxx13  User 2      20
9xxx14  User 3      25

Df2 looks like this:

CreatedBy     Role
User 1         Sales
User 2         Maintenance
User 3         Operations

My expected results would be: DfMerged

CA#     CreatedBy  $   User Role
9xxx12  User 1      10  Sales
9xxx13  User 2      20  Maintenance
9xxx14  User 3      25  Operations

I tried the following code variations, but they don't match all user IDs leaving some blanks in Df1 when there is data in Df2 that matches.

   merged= data_fr1.merge(data_fr2, on=['Created By'], how='left')

   merged2= pd.merge(data_fr1, data_fr2, left_on='Created By', 
   right_on='Created By', how='left')

Someone pointed to this post for an answer: Pandas Merging 101

But I'm still not getting the right results. The 'CreatedBy' field is not populating for all users in Df1. This field is a mix of text and numbers, eg: User1, User2, etc. I wonder if the datatype is interfering with the results.

I read the post and tried some of the solutions, the closest one to my problem involves using pd.merge(df2, on='key', how='left') but I still see errors. — Leo, Jun 21 '19 at 19:11

score 0 · Accepted Answer · answered Jun 21 '19 at 17:59

0

Does this not get you what you want doing merge? I'm unsure why you have the null column for role and everything under user but you can rename columns.

print('df')
print(df)
print('df2')
print(df2)
print('out_df')
print(out_df)

df.merge(df2[['By', 'Role']], on='By')
df
      CA# Created  By   $
0  9xxx12    User   1  10
1  9xxx13    User   2  20
2  9xxx14    User   3  25
df2
  Created  By         Role
0    User   1        Sales
1    User   2  Maintenance
2    User   3   Operations
out_df
      CA# Created  By   $         User  Role
0  9xxx12    User   1  10        Sales   NaN
1  9xxx13    User   2  20  Maintenance   NaN
2  9xxx14    User   3  25   Operations   NaN
Out[40]: 
      CA# Created  By   $         Role
0  9xxx12    User   1  10        Sales
1  9xxx13    User   2  20  Maintenance
2  9xxx14    User   3  25   Operations

Edit: Sorry, some of the issue is the clipboard parsing. The logic applies. If you're still having issues can you provide examples of "lines" that are not joining properly?

answered Jun 21 '19 at 17:59

krewsayder

446
4
9

I tried the code, but it replicated the records. My Df1 has about 3,207 records and the file that resulted from the code you shared gave me an output of over 52k records. – Leo Jun 21 '19 at 18:23
So in excel, vlookup takes the first match. In pandas you're joining 2 tables. What you have is a Cartesian join. This is where you're going to multiply results because you have many instances or records of the same user. So if you have 2 "User 1" and you merge you'll get 4 records instead of 2 because it's 2*2, not 2. The fix is adding another column to join on if you can to attempt to differentiate, adding a condition, or joining on a unique value if you can. – krewsayder Jun 21 '19 at 18:26
When I use this code: merged= data_fr1.merge(data_fr2, on=['Created By'], how='left') the records don't replicate but there are some CreatedBy users' ids that don't populate. The ones that do populate are correct. The CreatedBy key is a combination of string and text, eg: User1, User2, etc. I wonder if the data type is creating a problem. – Leo Jun 21 '19 at 19:06
The reason they're not populating is because the "right" side doesn't exist in the left side. Try how = 'outer' and you'll get all results from both "sides" of the join. If you're worried about type, try converting the whole column first with df['colname'] = df['colname'].apply(str) – krewsayder Jun 22 '19 at 19:26
Update: I didn't realize that the User ID on one of the dataframes was capitalized and it wasn't on the second dataframe. I used a capitalize function to homogenize the data and it worked. Thanks! – Leo Jun 24 '19 at 23:42

Vlookup in Pandas - Join or Merge?

1 Answers1