1

I want to do a special merge, but I'm not sure how to do it in pandas.

I have the following df1:

    CompanyName      Country Ticker  Revenue ....................
0   Apple Inc.       US      
1   Microsoft        US      MSFT    1235        
2   Sony             US 
3   DBS              SG      D05     5523 
4   Razer            HK      0700.HK 2231
5   General Electric US      GE      2131
6   Check Point      L 
7   Huawei           CN  

And I have this frame tickersdf:

    CompanyName       Ticker    Country
0   Apple Inc.        AAPL.MX   MX
1.  Apple Inc.        APC.DE    DE
2.  Apple Inc.        APC.F     F
3.  Microsoft         MSFT      US
4.  Sony              SNE       US
5   Razer             0700.HK   HK
6.  General Electric  GE        US
7.  Sony              6758.T    T
8.  Microsoft         MSFT.BA   BA
9   General Motors.   GM        US
10. Check Point       CHKP      US

Firstly, I want to merge on the CompanyName and Country as the keys.

This I would do with a simple command like,

df1 = pd.merge(df1,tickersdf,on=['CompanyName','Country'], how='left')

So Sony which is missing SNE will be filled up first.

    CompanyName      Country Ticker  Revenue ....................
0   Apple Inc.       US      
1   Microsoft        US      MSFT    1235        
2   Sony             US      SNE
3   DBS              SG      D05     5523 
4   Razer            HK      0700.HK 2231
5   General Electric US      GE      2131
6   Check Point      L   
7   Huawei           CN

As you can see, df1 has missing values and I want to get the values from tickersdf.

However, tickersdf has multiple ticker values for the same CompanyName.

If you look at Apple Inc., it doesn't have the US ticker available, but it has alternatives.

I would like to merge the following data frames like so that I get the following:

    CompanyName      Country Ticker    Revenue  ....................
0.  Apple Inc.       US      hasalt
1   Apple Inc.       MX      AAPL.MX    
2.  Apple Inc.       DE      APC.DE
3.  Apple Inc.       F       APC.F
4   Microsoft        US      MSFT      1235
5.  Microsoft        BA      MSFT.BA
6   Sony             US      SNE
7.  Sony             T       6758.T
8   DBS              SG      D05       5523
9   Razer            HK      0700.HK   2231
10  General Electric US      GE        2131
11. Check Point      L       hasalt
12. Check Point      US      CHKP
13. Huawei           CN      

I don't want to touch the rows that contain a Ticker already in df1.

For companies like Apple Inc. and Check Point in this example, the ticker isn't available in US or L, but, it has alternatives. So I want the ticker portion to say 'hasalt' or be left blank and I want the alternatives to be copied over into df1.

Microsoft and Sony already have MSFT and SNE filled in df1, however there are alternatives available, so I want those to be checked and copied over to df1 as well.

If a company name doesn't exist in df1, I don't want it to be added on, like General Motors wasn't in df1, so I don't want it to be brought over.

Companies like Huawei have no matches at all in the tickersdf, so I want it to be left blank.

Can this be done in pandas? If so, how?

I was thinking of doing this

df1['Ticker'] = df1['Ticker'].fillna(df1['CompanyName'].map(tickersdf.set_index('CompanyName')['Ticker']))

But tickersdf has duplicates, then I also considered merging or joining, but I don't think it can be done that way because the entries that are in tickersdf that are not in df1 will appear in df1.

I know I can filter out the alternatives using this

tickersdf[(tickersdf['CompanyName'].str.contains('Check Point'))]

But how do I copy the rows over into df1 based on my condition?

I think it would look something like,

if df1['CompanyName'] is in tickersdf['CompanyName'], 

then set df1['Ticker'] = 'hasalt' 

followed by tickersdf[(tickersdf['CompanyName'].str.contains(df1['CompanyName']))] 

and copying all the alternatives over to df1. 

else ignore

What's the right way to do this? Could it be done with some kind of special ffill?

anarchy
  • 3,709
  • 2
  • 16
  • 48

2 Answers2

1

I think what you need is an outer merge of the two dataframes (see also this excellent post about merging)

df2 = pd.merge(df1,tickersdf,on=['CompanyName','Country'], how='outer', sort=True)

followed by some post-processing for meeting your conditions - first drop the companies that are not present in df1:

df2 = df2.drop(df2[~df2.CompanyName.isin(df1.CompanyName.values)].index).reset_index(drop=True)

Note that I have assigned the result of the merge to another dataframe df2, so you can still access the original dataframe df1 in this step.

Finally, create a new Ticker column based on Ticker_x and Ticker_y:

df2['Ticker'] = df2.apply (lambda r: r.Ticker_y if pd.isnull(r.Ticker_x) or r.Ticker_x == '' else r.Ticker_x, axis=1)
Gerd
  • 2,568
  • 1
  • 7
  • 20
1

All the work is made by an outer join... the rest is only filling and rearrange data following your logic

df1 = pd.DataFrame({'CompanyName':['Apple Inc.','Microsoft','Sony','DBS','Razer','General Electric','Check Point','Huawei'],
                    'Country':['US','US','US','SG','HK','US','L','CN'],
                    'Ticker':['','MSFT','','D05','0700.HK','GE','',''],
                    'Revenue':[np.nan,1235,np.nan,5523,2231,2131,np.nan,np.nan]})
tickersdf = pd.DataFrame({'CompanyName':['Apple Inc.','Apple Inc.','Apple Inc.','Microsoft','Sony','Razer','General Electric','Sony',
                                         'Microsoft','General Motors.','Check Point'],
                          'Ticker':['AAPL.MX','APC.DE','APC.F','MSFT','SNE','0700.HK','GE','6758.T','MSFT.BA','GM','CHKP'],
                          'Country':['MX','DE','F','US','US','HK','US','T','BA','US','US']})

df = pd.merge(df1,tickersdf,on=['CompanyName','Country'], how='outer')
df.loc[(df.CompanyName.isin(tickersdf.CompanyName))&df.Ticker_y.isna(), 'Ticker_y'] = 'hasalt'
df.loc[~(df.CompanyName.isin(tickersdf.CompanyName)),'Ticker_y'] = df.loc[~(df.CompanyName.isin(tickersdf.CompanyName))].Ticker_x

# rearrange data
df = df[df.CompanyName.isin(df1.CompanyName)].sort_values('CompanyName').reset_index(drop=True)
df.drop('Ticker_x', axis=1, inplace=True)
df = df.rename(columns={'Ticker_y':'Ticker'})

enter image description here

Marco Cerliani
  • 21,233
  • 3
  • 49
  • 54