What I'm going to do
I'd like to get average stock price, regression coefficient and R-square of stock prices in float by stock item, e.g. Apple, Amazon, etc., and certain date period, e.g. Feb. 15 ~ Mar.14. as a part of quantitative investment simulation encompassing 30 years. The problem is that it simply is too slow. At first, I made the whole code with PostgreSQL but it was too slow - didn't finish after 2 hours. After asking a professor friend in management information system, I'm trying pandas for the first time.
The data structure implemented so far look like this:
Raw data (Dataframe named dfStock)
──────────────────────────────────────────
Code | Date | Date Group | Price |
──────────────────────────────────────────
AAPL | 20200205 | 20200205 | ###.## |
AAPL | 20200206 | 20200305 | ###.## |
...
AAPL | 20200305 | 20200305 | ###.## |
AAPL | 20200306 | 20200405 | ###.## |
...
──────────────────────────────────────────Results (Dataframe named dfSumS)
──────────────────────────────────────────
Code | Date group | Avg. Price | Slope | R-Square
──────────────────────────────────────────
AAPL | 20200205 | ###.## | #.## | #.##
AMZN | 20200205 | ###.## | #.## | #.##
...
AAPL | 20200305 | ###.## | #.## | #.##
AMZN | 20200305 | ###.## | #.## | #.##
...
──────────────────────────────────────────
Code As of Now
'prevdt' corresponds to 'Date Group' in the above and 'compcd' means company code
from sklearn.linear_model import LinearRegression
# Method Tried 1
model = LinearRegression()
def getRegrS(arg_cd, arg_prevdt):
x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy().reshape((-1,1))
y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
model.fit(x, y)
return model.coef_[0], model.score(x,y)
# Method Tried 2
def getRegrS(arg_cd, arg_prevdt):
x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy()
y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
rv = stats.linregress(x,y)
return rv[0], rv[2]
dfSumS['rnk'] = dfStock.groupby(['compcd','prevdt']).cumcount()+1
dfSumS[['slope','rsq']]= [getRegrS(cd, prevdt) for cd, prevdt in zip(dfSumS['compcd'], dfSumS['prevdt'])]
What I've tried before
Based on recommendation in this link, I tried vectoriztion, but got the message "Can only compare identically-labeled Series objects". Unable to solve this problem, I came to two functions in the above, which were not fast enough. Both worked with a smaller set of code like the year of 2020, but once the data period became as large as 2~3 decades, it took hours.
I thought of apply, iterrows, etc., but didn't because firstly the link says it's slower than I've done and secondly each of these seem to apply to only one column while I have to two results - coefficient and R-square over the same period so that calling them twice definitely will be slower.
Now I'm trying the pool thing mentioned in a few posts