2

I have a dataframe of financial data (monthly returns) for a relative large number of firms (~25k, each represented by a PERMNO) from 1926 until today.

dependent variables: dataframe of monthly return

My goal is to regress these on three independent variables, namely the market return (Mkt-RF) and two factor portfolio returns (SMB & HML), using a window of 60 months of returns for each regression.

independent variables

As far as I understand, I can use statsmodel's RollingOLS to achieve this for each firm. However this would imply looping over all 25k firms, which seems not to be very efficient. For example, running the regression for one firm takes ~250 ms, i.e. for all 25k this would be > 1.5h. Actually, the time would still increase because beyond calculating the coefficients I also need to do further operations, e.g. on the regression residuals.

timing example

My first idea was matching the independent variables to the dependent variables, to construct the following df:

df of all firms

Then using something like

df.rolling(window=60, min_periods=24).apply(lambda x: myreg(x['RET'], x[['Mkt-RF', 'SMB', 'HML']])

and matrix algebra to calculate the regression coefficient

def myreg(x, y):
   Y = y.values.reshape(-1,1)
   X = np.concatenate((np.ones(shape=Y.shape), x.values), 1)
   coeffs = inv(X.transpose().dot(X)).dot(X.transpose()).dot(Y)

   return coeffs

I am aware of at least two problems, why this does not work:

  1. The .rolling does not "start from 0" for a new firm, i.e. would mix up data from different firms in the regression.
  2. rolling().apply() works on each of the input dataframes' columns seperately (see here), i.e. I did not find a way to apply a function which calculates the regression coefficients.

Although I'm pretty sure I'm not the first one with this problem I was not able to find a helpful answer in previous discussions on similar topics.

I assume there is a solution for this without inefficient/costly looping and saving intermediate results, but I have no idea how to do it. Any suggestions on how to achieve my goal of efficient calculation of the regression coefficients or an idea how to work around above mentioned issues?

simon_dmh
  • 21
  • 2
  • did you ever solve this? I'm trying to do something similar. I used something similar to this [link](https://stackoverflow.com/questions/19401078/efficient-columnwise-correlation-coefficient-calculation) for correlations in case that's helpful to you – fazistinho_ Mar 18 '21 at 10:48

0 Answers0