I´m trying to do a linear regression on the results of a dataframe groupby by date and aggregate the results on another dataframe. So far I´m using an intermediate Series:
The dataframe is similar to
marker date variable identifier value
EA 2007-01-01 0.33 55 123
EA 2007-01-01 0.73 56 1123
EA 2007-01-01 0.51 57 123
EA 2007-02-01 0.13 55 4446
EA 2007-02-01 0.23 57 667
EA 2007-03-01 0.82 55 5675
EA 2007-03-01 0.88 56 1
EB 2007-01-01 0.13 45 123
EB 2007-01-01 0.74 46 33234
EB 2007-01-01 0.56 47 111
EB 2007-02-01 0.93 45 42657
EB 2007-02-01 0.23 47 12321355
EB 2007-03-01 0.82 45 9897
EB 2007-03-01 0.38 46 786
EB 2007-03-01 0.19 47 993845
And the code snippet:
import statsmodels as sm
import pandas as pd
reg_results = pd.Series(name='reg_results')
mean_results = pd.Series(name='mean_results')
for date, group in df.groupby(df.index.date):
formula = sm.formula.ols('value ~ variable', data=group).fit()
reg_results.set_value(date.strftime("%Y-%m-%d"), formula.params['Intercept'] + formula.params['variable']*group['variable'])
mean_results.set_value(date.strftime("%Y-%m-%d"), group.mean()['variable'])
final_df = pd.DataFrame()
final_df = pd.concat([reg_results, mean_results], axis=1)
There are other operations like a second groupby on the group and so on, so I get to create one series per operation that I want to create, and this gets very complicated very fast. Is there a way to do this on one step, or at least without the intermediate series?