1

I´m trying to do a linear regression on the results of a dataframe groupby by date and aggregate the results on another dataframe. So far I´m using an intermediate Series:

The dataframe is similar to

marker    date         variable       identifier  value
EA    2007-01-01      0.33            55          123
EA    2007-01-01      0.73            56          1123
EA    2007-01-01      0.51            57          123
EA    2007-02-01      0.13            55          4446
EA    2007-02-01      0.23            57          667
EA    2007-03-01      0.82            55          5675
EA    2007-03-01      0.88            56          1
EB    2007-01-01      0.13            45          123
EB    2007-01-01      0.74            46          33234
EB    2007-01-01      0.56            47          111
EB    2007-02-01      0.93            45          42657
EB    2007-02-01      0.23            47          12321355
EB    2007-03-01      0.82            45          9897
EB    2007-03-01      0.38            46          786
EB    2007-03-01      0.19            47          993845

And the code snippet:

import statsmodels as sm
import pandas as pd



reg_results = pd.Series(name='reg_results')
mean_results = pd.Series(name='mean_results')
for date, group in df.groupby(df.index.date):
    formula = sm.formula.ols('value ~ variable', data=group).fit()
    reg_results.set_value(date.strftime("%Y-%m-%d"), formula.params['Intercept'] + formula.params['variable']*group['variable'])
    mean_results.set_value(date.strftime("%Y-%m-%d"), group.mean()['variable'])

final_df = pd.DataFrame()
final_df = pd.concat([reg_results, mean_results], axis=1)

There are other operations like a second groupby on the group and so on, so I get to create one series per operation that I want to create, and this gets very complicated very fast. Is there a way to do this on one step, or at least without the intermediate series?

Ivan
  • 19,560
  • 31
  • 97
  • 141

1 Answers1

1

Your code seems pretty much OK, except that


Here is something similar to your code (the computer I'm logged into right now doesn't have statsmodels installed)

res = df.groupby(df.date).apply(lambda g: pd.Series({'mean': g.variable.mean(), 'min': g.variable.min()}))

(Note that it's df.date in this example, by you it's a bit different.)

This is like your OLS in the sens that there's something to be done with each group, and multiple return values. In your case, you would use a function that takes a group and returns a series of the OLS results.


Now that we have that, it's simply

pd.concat([df, res], axis=1)

Note the axis=1 - it means to concat horizontally, which I believe is what you mean here.

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185