I have a dataframe of financial data (monthly returns) for a relative large number of firms (~25k, each represented by a PERMNO) from 1926 until today.
dependent variables: dataframe of monthly return
My goal is to regress these on three independent variables, namely the market return (Mkt-RF) and two factor portfolio returns (SMB & HML), using a window of 60 months of returns for each regression.
As far as I understand, I can use statsmodel's RollingOLS
to achieve this for each firm. However this would imply looping over all 25k firms, which seems not to be very efficient. For example, running the regression for one firm takes ~250 ms, i.e. for all 25k this would be > 1.5h. Actually, the time would still increase because beyond calculating the coefficients I also need to do further operations, e.g. on the regression residuals.
My first idea was matching the independent variables to the dependent variables, to construct the following df:
Then using something like
df.rolling(window=60, min_periods=24).apply(lambda x: myreg(x['RET'], x[['Mkt-RF', 'SMB', 'HML']])
and matrix algebra to calculate the regression coefficient
def myreg(x, y):
Y = y.values.reshape(-1,1)
X = np.concatenate((np.ones(shape=Y.shape), x.values), 1)
coeffs = inv(X.transpose().dot(X)).dot(X.transpose()).dot(Y)
return coeffs
I am aware of at least two problems, why this does not work:
- The
.rolling
does not "start from 0" for a new firm, i.e. would mix up data from different firms in the regression. rolling().apply()
works on each of the input dataframes' columns seperately (see here), i.e. I did not find a way to apply a function which calculates the regression coefficients.
Although I'm pretty sure I'm not the first one with this problem I was not able to find a helpful answer in previous discussions on similar topics.
I assume there is a solution for this without inefficient/costly looping and saving intermediate results, but I have no idea how to do it. Any suggestions on how to achieve my goal of efficient calculation of the regression coefficients or an idea how to work around above mentioned issues?