0

I am running regressions by group. I am outputting coefficients and the entire vector of residuals for each group. This results in a tuple with differently sized "elements”.

Currently I am having trouble unpacking it into two separate, workable data frames:

import numpy as np
import pandas as pd
import statsmodels.api as sm

# simple regression
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X, missing='drop').fit()
    return result.params, result.resid

# regression by group
def regress_groupby(data, yvar, xvars, groupby):
    df = data
    result = df.groupby(groupby).apply(regress, yvar, xvars)
    return result

# simulate data
N = 5
T = 20

df = pd.DataFrame(pd.Series(range(0, N)), columns = ['Id'])
df = df.reindex(np.repeat(df.index, T)).reset_index(drop=True)

df['y'] = np.random.normal(0,1, len(df.index))
df['x'] = np.random.normal(0,1, len(df.index))
df['y'] = 0 + 1*df['x'] + np.random.normal(0,1, len(df.index))
                 
fit = regress_groupby(data=df, yvar='y', xvars=['x'], groupby='Id')
print(fit)

What I would like to have is simply one Nx2 dataframe with the intercept in the 1st column and the coefficient estimate in the 2nd column, and one (NxT)x1 dataframe with the residuals (so the same dimension as the original dataframe). It would be nice if the 'Id' is carried along as well.

I could of course just output the coefficient estimates, merge them back into the dataframe, and calculate the residuals myself. However, that approach seems less flexible in case there are multiple regressors...I found this, but it didn't help me.

AlexK
  • 2,855
  • 9
  • 16
  • 27
sleyde
  • 1
  • 2

1 Answers1

0

Your fit object is a series of tuples of series, with Id as the series index.

You can first use

coefs, resids = zip(*fit)

to split the tuples and put the coefficients/intercepts into one tuple and residuals into another.

Then, to get the coefficients/intercept values into a dataframe, you can use:

coefs_df = pd.DataFrame(coefs, index=fit.index)
print(coefs_df)
#           x   intercept
# Id        
# 0  1.204225   -0.468649
# 1  0.906064   -0.015549
# 2  1.208573   0.011745
# 3  1.070190   0.335113
# 4  0.756508   -0.351270

To create a dataframe with residuals, use:

resids_df = pd.DataFrame(zip(fit.index, resids), columns=['Id', 'Resids'])
resids_df = resids_df.set_index('Id')['Resids'].explode().reset_index()
print(resids_df.head())
#    Id    Resids
# 0   0  0.129417
# 1   0 -1.453258
# 2   0  0.398382
# 3   0  1.546869
# 4   0  2.002856
AlexK
  • 2,855
  • 9
  • 16
  • 27