Merge different length dfs and preserve all values from "master" df

Question

Working on what I think should be a simple merge but I can't find quite the right solution.

I have two dfs of Fortune 500 companies. df1 is 2 columns (Company and CIK), 117 rows long. df2 is 2 columns (Rank, and Company) and 225 rows long. The company order is different between the dfs.

I want a dataframe that matches the company names (even if the company names aren't spelled 100% the same), preserves the order of df2, and fills in the CIK values from df1 to df2, and fills in Na in the remaining values (108 will be NA).

This merge only gives 52 values (probably the ones that are exactly matching).

df3 = df2.merge(df1, on="Company")

This merge fills a lot of NaN values where things don't exactly match:

df3 = df2.merge(df1, on="Company", how = "outer")

df1:

{'Company': {0: 'Amazon',
  1: 'Verizon Communications',
  2: 'AT&T',
  3: 'Alphabet',
  4: 'Intel',
  5: 'Facebook',
  6: 'Exxon Mobil',
  7: 'Microsoft',
  8: 'Duke Energy',
  9: 'Comcast',
  10: 'Exelon'},
 'CIK': {0: '0001018724',
  1: '0000732712',
  2: '0001018724',
  3: '0001652044',
  4: '0000050863',
  5: '0001326801',
  6: '0000034088',
  7: '0000789019',
  8: '0001326160',
  9: '0001166691',
  10: '0001109357'}}

df2:

{'Rank': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 9,
  9: 10,
  10: 11},
 'Company': {0: 'Walmart',
  1: 'Amazon',
  2: 'Apple',
  3: 'CVS Health',
  4: 'United Health Group',
  5: 'Berkshire Hathaway',
  6: 'McKesson',
  7: 'AmerisourceBergen',
  8: 'Alphabet',
  9: 'Exxon Mobil',
  10: 'AT&T'}}

I have also tried various .concat and .join but have had similar problems.

Would a loop be more appropriate?

Thank you!

UPDATE: using a left join, the CIK row fills in with NaN (picture attached). Also, should mention that the CIK column is dtype=str because dtype=int64 eliminates the leading zeros and I need the zeros for the API call.

Can you include your dataframe as copy-pastable code, not an image? You can use `df.to_dict()` to accomplish this. See also [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Nick ODell, Jan 25 '22 at 19:29
The attached df1 and df2 is not matched with your description. It seems to be exchanged. And df2 has 2 columns according to the attached images but you mentioned it has only 1 column, which one is correct? — Park, Jan 25 '22 at 20:14
Thank you all, *so much* for pushing me a bit on this. Turns out, with the df.to_dict() I found out that I had some funky unicode in df2 and was able to delete that and successfully merge as intended. — jshapi16, Jan 25 '22 at 22:58
Also thank you for giving me more tips on asking better questions. I will save that for future use! — jshapi16, Jan 25 '22 at 22:58
@jshapi16 If one of answers helped solve your problem, pls mark the answer as accepted, so that it can help other users know this question is solved by the answer. — Park, Feb 07 '22 at 14:06

Park · Answer 1 · 2022-01-25T20:27:11.200

What you want to do is LEFT JOIN. I made my own sample data considering the attached images.

You can refer this code and understand how it works. If all CIK values are NaN in your result, it might be because 'Company' values are not correctly matched between df1 and df2 by some reasons, such as empty spaces in the values.

import pandas as pd

df1 = pd.DataFrame({
    'Company': ['Amazon', 'Intel', 'Alphabet', 'Tech Data'],
    'CIK': ['0001018724', '000050863', '0001652044', '0000790703'],
})

df2 = pd.DataFrame({
    'Rank': [1, 2, 3, 4, 5],
    'Company': ['Walmart', 'Amazon', 'Apple', 'CVS Health', 'Tech Data'],
})

df3 = df2.merge(df1, on='Company', how='left')

print(df3)
#   Rank     Company         CIK
#0     1     Walmart         NaN
#1     2      Amazon  0001018724
#2     3       Apple         NaN
#3     4  CVS Health         NaN
#4     5   Tech Data  0000790703

There are several kinds of JOINs, such as LEFT JOIN, RIGHT JOIN, OUTER JOIN, INNER JOIN, and so on.

You can see various JOINs with example codes here:

Thanks. I just tried this, but the output gives me all the companies but only NaN for the CIK values. Any ideas why that might be? — jshapi16, Jan 25 '22 at 14:57
@jshapi16 Could you update your question with sample data, so that I can see your expected output? — Park, Jan 25 '22 at 18:57

score 0 · Accepted Answer · answered Jan 25 '22 at 23:03

Using:

df2.to_dict()

I saw there were extra characters in

df2['Company']

This code block deleted the additional characters, "\xa0":

df2['Company'] = df2["Company"].apply(lambda x: str(x).replace(u'\xa0', u''))

Then performed the left join:

df3 = df2.merge(df1, on="Company", how="left")

Output:

     Rank                   Company         CIK
0       1                   Walmart  0000104169
1       2                    Amazon  0001018724
2       3                     Apple  0000320193
3       4                CVS Health  0000064803
4       5       United Health Group         NaN

Thanks for all the help!

Merge different length dfs and preserve all values from "master" df

2 Answers2