Operate on groupby and concatenate results on a dataframe in pandas

Question

I´m trying to do a linear regression on the results of a dataframe groupby by date and aggregate the results on another dataframe. So far I´m using an intermediate Series:

The dataframe is similar to

marker    date         variable       identifier  value
EA    2007-01-01      0.33            55          123
EA    2007-01-01      0.73            56          1123
EA    2007-01-01      0.51            57          123
EA    2007-02-01      0.13            55          4446
EA    2007-02-01      0.23            57          667
EA    2007-03-01      0.82            55          5675
EA    2007-03-01      0.88            56          1
EB    2007-01-01      0.13            45          123
EB    2007-01-01      0.74            46          33234
EB    2007-01-01      0.56            47          111
EB    2007-02-01      0.93            45          42657
EB    2007-02-01      0.23            47          12321355
EB    2007-03-01      0.82            45          9897
EB    2007-03-01      0.38            46          786
EB    2007-03-01      0.19            47          993845

And the code snippet:

import statsmodels as sm
import pandas as pd



reg_results = pd.Series(name='reg_results')
mean_results = pd.Series(name='mean_results')
for date, group in df.groupby(df.index.date):
    formula = sm.formula.ols('value ~ variable', data=group).fit()
    reg_results.set_value(date.strftime("%Y-%m-%d"), formula.params['Intercept'] + formula.params['variable']*group['variable'])
    mean_results.set_value(date.strftime("%Y-%m-%d"), group.mean()['variable'])

final_df = pd.DataFrame()
final_df = pd.concat([reg_results, mean_results], axis=1)

There are other operations like a second groupby on the group and so on, so I get to create one series per operation that I want to create, and this gets very complicated very fast. Is there a way to do this on one step, or at least without the intermediate series?

score 1 · Accepted Answer · edited May 23 '17 at 11:51

Your code seems pretty much OK, except that

I'd suggest that you use groupby/apply returning a Series (see Returning Multiple Values From Apply).
The concat seems to be on the wrong axis

Here is something similar to your code (the computer I'm logged into right now doesn't have statsmodels installed)

res = df.groupby(df.date).apply(lambda g: pd.Series({'mean': g.variable.mean(), 'min': g.variable.min()}))

(Note that it's df.date in this example, by you it's a bit different.)

This is like your OLS in the sens that there's something to be done with each group, and multiple return values. In your case, you would use a function that takes a group and returns a series of the OLS results.

Now that we have that, it's simply

pd.concat([df, res], axis=1)

Note the axis=1 - it means to concat horizontally, which I believe is what you mean here.

It was a typo, the concat is on the axis=1. – Ivan Jun 23 '15 at 13:37 — Ivan, Jun 23 '15 at 13:37

Operate on groupby and concatenate results on a dataframe in pandas

1 Answers1