2

I have tried to match two data frames on three columns, namely "ticker", "year", and "quarter". While the first dataset contains two observations (i.e. two rows) for each ticker, year, quarter, the second only contains one row... especially, the first dataframe always captures two different speakers in the same quarter while the second dataframe only captures firm information. Now, I want to match the dataframe so that each of the two rows of the first data frame within the same firm quarter are merged with the one relevant row in the second dataframe.

My data looks as follows: df

ticker  year    quarter exec_lname  jobposition
XX      2009    3       A           CEO
XX      2009    3       B           CFO
XX      2009    4       A           CEO
XX      2009    4       B           CFO
YY      2007    1       C           CEO
YY      2007    1       D           CFO
YY      2007    2       C           CEO
YY      2007    2       D           CFO
ZZ      2008    3       F           CEO
ZZ      2008    3       G           CFO

dfnew

ticker  year    quarter eps calldate
XX      2009    3       x   Mar
XX      2009    4       y   Apr
YY      2007    1       z   Feb
YY      2007    2       a   Jan
ZZ      2008    3       b   Dec

At the end, it should look like this:

ticker  year    quarter exec_lname  jobposition eps calldate
XX      2009    3       A           CEO         x   Mar
XX      2009    3       B           CFO         x   Mar
XX      2009    4       A           CEO         y   Apr
XX      2009    4       B           CFO         y   Apr
YY      2007    1       C           CEO         z   Feb
YY      2007    1       D           CFO         z   Feb
YY      2007    2       C           CEO         a   Jan
YY      2007    2       D           CFO         a   Jan
ZZ      2008    3       F           CEO         b   Dec
ZZ      2008    3       G           CFO         b   Dec

I tried:

dfjoin = pd.merge(dfnew, df,  how='left', left_on=['ticker', "year", "quarter"], right_on = ['ticker', "year", "quarter"])

but it returns the new dataset with all the right rows and columns, yet with the columns eps and calldate completely filled with NaN. Might this be because I want to merge each row twice to df? The problem is not merging on more than one key - the problem might be that in the first dataframe I always have two rows with the same combination of ticker/year/quarter.

I hope someone can help me! Thank you! Julia

Julia
  • 109
  • 2
  • 7
  • Just use df1.merge(df2, on = ['ticker', 'year', 'quarter']) – Vaishali Oct 18 '17 at 20:39
  • Possible duplicate of [Merge two tables based on multiple keys in Python pandas](https://stackoverflow.com/questions/32277473/merge-two-tables-based-on-multiple-keys-in-python-pandas) – Vaishali Oct 18 '17 at 20:39
  • That leaves me with all columns but 0 rows... – Julia Oct 18 '17 at 20:45
  • @Julia I just tried what you suggested in the question (`pd.merge(dfnew, df, how='left', left_on=['ticker', "year", "quarter"], right_on = ['ticker', "year", "quarter"])`) and it worked perfectly with your example data, no `NaN`s were introduced. Try reading in the data again and restart your Python session. – joelostblom Oct 18 '17 at 21:38

2 Answers2

0

Use merge with default parameter how = 'inner'

df.merge(dfnew, on = ['ticker', 'year', 'quarter'])

You get

    ticker  year    quarter exec_lname  jobposition eps calldate
0   XX      2009    3       A           CEO         x   Mar
1   XX      2009    3       B           CFO         x   Mar
2   XX      2009    4       A           CEO         y   Apr
3   XX      2009    4       B           CFO         y   Apr
4   YY      2007    1       C           CEO         z   Feb
5   YY      2007    1       D           CFO         z   Feb
6   YY      2007    2       C           CEO         a   Jan
7   YY      2007    2       D           CFO         a   Jan
8   ZZ      2008    3       F           CEO         b   Dec
9   ZZ      2008    3       G           CFO         b   Dec
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • When I do that I am left with 0 rows and all columns... if I do outer, then all rows are added twice, one with only info ticker, year, quarter, exec_lname, jobposition and then once with ticker, year, quarter, eps, calldate... with inner I get no row at all. – Julia Oct 18 '17 at 21:22
  • Hmm, its difficult to help when we are not able to reproduce the error. Its working fine on the data you provided – Vaishali Oct 19 '17 at 00:01
0

Just merge both data frames and specify the variable names in "by" section.

merge(df,dfnew,by=c("ticker", "year", "quarter"), all.x=TRUE)

Sivajee Battina
  • 4,124
  • 2
  • 22
  • 45
Miguel
  • 1
  • 1