Pandas Regression by group of two columns

Question

What I'm going to do

I'd like to get average stock price, regression coefficient and R-square of stock prices in float by stock item, e.g. Apple, Amazon, etc., and certain date period, e.g. Feb. 15 ~ Mar.14. as a part of quantitative investment simulation encompassing 30 years. The problem is that it simply is too slow. At first, I made the whole code with PostgreSQL but it was too slow - didn't finish after 2 hours. After asking a professor friend in management information system, I'm trying pandas for the first time.

The data structure implemented so far look like this:

Raw data (Dataframe named dfStock)
──────────────────────────────────────────
Code | Date | Date Group | Price |
──────────────────────────────────────────
AAPL | 20200205 | 20200205 | ###.## |
AAPL | 20200206 | 20200305 | ###.## |
...
AAPL | 20200305 | 20200305 | ###.## |
AAPL | 20200306 | 20200405 | ###.## |
...
──────────────────────────────────────────
Results (Dataframe named dfSumS)
──────────────────────────────────────────
Code | Date group | Avg. Price | Slope | R-Square
──────────────────────────────────────────
AAPL | 20200205 | ###.## | #.## | #.##
AMZN | 20200205 | ###.## | #.## | #.##
...
AAPL | 20200305 | ###.## | #.## | #.##
AMZN | 20200305 | ###.## | #.## | #.##
...
──────────────────────────────────────────

Code As of Now

'prevdt' corresponds to 'Date Group' in the above and 'compcd' means company code

from sklearn.linear_model import LinearRegression

# Method Tried 1    
model = LinearRegression()   
def getRegrS(arg_cd, arg_prevdt):
    x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy().reshape((-1,1))
    y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
    model.fit(x, y)
    return model.coef_[0], model.score(x,y)

# Method Tried 2
def getRegrS(arg_cd, arg_prevdt):
    x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy()
    y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
    rv = stats.linregress(x,y)
    return rv[0], rv[2]
    
dfSumS['rnk'] = dfStock.groupby(['compcd','prevdt']).cumcount()+1
dfSumS[['slope','rsq']]= [getRegrS(cd, prevdt) for cd, prevdt in zip(dfSumS['compcd'], dfSumS['prevdt'])]

What I've tried before

Based on recommendation in this link, I tried vectoriztion, but got the message "Can only compare identically-labeled Series objects". Unable to solve this problem, I came to two functions in the above, which were not fast enough. Both worked with a smaller set of code like the year of 2020, but once the data period became as large as 2~3 decades, it took hours.

I thought of apply, iterrows, etc., but didn't because firstly the link says it's slower than I've done and secondly each of these seem to apply to only one column while I have to two results - coefficient and R-square over the same period so that calling them twice definitely will be slower.

Now I'm trying the pool thing mentioned in a few posts

score 0 · Answer 1 · edited Oct 04 '20 at 17:15

0

I'm afraid that if you're trying to run thousands of large linear regressions, then you will have to pay the price in time spent running. If you are only interested in the beta coefficient or the r2 value, it could be more efficient to calculate them separately with numpy as (XtX)^(-1)Xty and cov(X,y)/sqrt(var(X)var(y)) respectively.

edited Oct 04 '20 at 17:15

abRao

2,787
1
25
37

answered Oct 04 '20 at 12:54

Cameron Chandler

407
3
10

If I'm using just either coefficient or R2, which is faster, using library, or making function as you mentioned? I searched and found some answers in stackoverflow suggesting libraries like statsmodels.api or sklearn.linear_model ? – Rich KS Oct 05 '20 at 03:53
It's possible that writing your own function could be faster since there will be less overhead, but the greatest speed boost will be from calculating the r2 value without computing the model, since you only need to calculate variances and covariances. That should be very quick – Cameron Chandler Oct 05 '20 at 06:01

Pandas Regression by group of two columns

1 Answers1