2

I am making a small mistake and I'm not sure how to merge two df correctly. I want to merge on IBES_cusip to get gvkey into df1.

I try the following, but it just expands the dataset out and does not match correctly:

df1 = df1.merge(df2, how = 'left', on =['IBES_cusip'])

df1

       IBES_cusip       pends pdicity  ... ltg_eps  futepsgrowth 
0        00036110  1983-05-31     ANN  ...     NaN           NaN  
1        00036110  1983-05-31     ANN  ...     NaN           NaN  
2        00036110  1983-05-31     ANN  ...     NaN           NaN  
3        98970110  1983-05-31     ANN  ...     NaN           NaN  
4        98970110  1983-05-31     ANN  ...     NaN           NaN  
          ...         ...     ...  ...     ...           ...      
373472   98970111  2018-12-31     ANN  ...   10.00      0.381119  
373473   98970111  2018-12-31 

df2

         gvkey IBES_cusip
0       024538   86037010
1       004678   33791510
2       066367   26357810
3       137024   06985P20
4       137024   06985P20
       ...        ...
833796  028955   33975610
833797  061676   17737610
833798  011096   92035510
833799  005774   44448210
833800  008286   69489010
oceanbeach96
  • 604
  • 9
  • 19

1 Answers1

2

Your main problem is that your df2 contains duplicate values in IBES_cusip column. from the sample you gave I can see that

3       137024   06985P20
4       137024   06985P20

are the same values, this would cause the to get unwanted results (duplicate rows in the output).
try this

df1 = df1.merge(df2.drop_duplicates(subset=['IBES_cusip']), how='left', on='IBES_cusip')

Which should technically just add a gvkey column to your df1.

This assumes that you are pretty sure that you don't have rows with the same IBES_cusip that are matched with different gvkey otherwise you need to figure that out first.

Jimmar
  • 4,194
  • 2
  • 28
  • 43