0

I have a daily return data of a number of stocks that looks like:

           Stock A  Stock B  Stock C  Market
date
1987-02-02  0.01    0.02      0.02     0.01
1987-02-03  0.02    0.03      0.02     0.02
1987-02-04  0.03    0.01      0.01     0.03
1987-02-05  0.04    0.03      0.05     0.04

I want to calculate 30 day regression of Stock A, B, C on market, but only at the end of each month, i.e. at 1987-02-28, 1987-03-31... And then save the regression results in two matrices (one for the constant term, and one for the coefficient):

           Stock A  Stock B  Stock C  
date
1987-02-28  const    const     const   
1987-03-31  const    const     const  
1987-04-30  const    const     const 

           Stock A  Stock B  Stock C 
date
1987-02-28  coeff    coeff     coeff 
1987-03-31  coeff    coeff     coeff 
1987-04-30  coeff    coeff     coeff 

So far what I did was to create an indicator for "end of month" and then loop over all rows and columns of the date:

loop over columns:
    loop over rows:
        if end of month is true:
            regress Stock on Market using past 30 days data
            save the result 

Given I have a lot of stocks (7000+) over 50 years, this way is very slow. I am wondering if anyone has worked on similar problems before and has faster way of implementing this? Any tip on how to improve the speed or efficiency would be greatly appreciated.

ian_chan
  • 355
  • 5
  • 15

1 Answers1

1

You could start with an approach like this instead

import pandas as pd
import numpy as np
from datetime import datetime
from pandas.tseries.offsets import MonthEnd

#sample Data
df = pd.DataFrame(index=pd.DatetimeIndex(freq='D',start=datetime(1990,1,1),end=datetime(1995,12,25)),data=np.random.randint(0,100,(2185,3)),columns=['Stock A','Stock B','Stock C'])

#Create a column that has the end of the month for each date
df['end'] = df.index + MonthEnd(1)

#Groupby the end of the month and apply your regression function
for group,data in df.groupby('end'):
    for row in data.columns #Sudo code loop over columns used per regression
        regressFunction()

This should eliminate wasted time looping over rows and columns and instead, just compute the regression on known indices. Also, it maybe beneficial to compute the regression in parallel to speed up computation.

DJK
  • 8,924
  • 4
  • 24
  • 40
  • can i ask what if i want to use past 3 months in addition to just one month, how would i change the code? in other words, for each group, i need data from "current group", "group - 1", and "group -2". thanks! – ian_chan Jan 30 '18 at 23:49