Efficient Python Pandas Stock Beta Calculation on Many Dataframes

Question

I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.

How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.

Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.

Your help would be much appreciated! Thank you :)

import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10)  #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()

MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS 
# ***********************************************************************************************
for File in FilesLoaded:
    FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
    np_array = df.values
    m = np_array[:,0] # market returns are column zero from numpy array
    s = np_array[:,1] # stock returns are column one from numpy array
    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta

#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
    if min_periods is None:
        min_periods = period
    result = pd.Series(np.nan, index=df.index)
    for i in range(1, len(df)+1):
        sub_df = df.iloc[max(i-period, 0):i,:]
        if len(sub_df) >= min_periods:  
            idx = sub_df.index[-1]
            result[idx] = func(sub_df)
    return result

#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)    
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan

for File in FilesLoaded:
    df_join['stock'].update(FilesLoaded[File]['Return'])
    df_join  = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
    df_join  = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
    df_join  = df_join.fillna(0) #get rid of the NaNs in the return data
    FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)

# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))

score 19 · Accepted Answer · edited May 23 '17 at 12:17

19

Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks

dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)

df.iloc[:5, :5]

Roll Function
Returns groupby object ready to apply custom functions
See Source

def roll(df, w):
    # stack df.values w-times shifted once at each stack
    roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
    # roll_array is now a 3-D array and can be read into
    # a pandas panel object
    panel = pd.Panel(roll_array, 
                     items=df.index[w-1:],
                     major_axis=df.columns,
                     minor_axis=pd.Index(range(w), name='roll'))
    # convert to dataframe and pivot + groupby
    # is now ready for any action normally performed
    # on a groupby object
    return panel.to_frame().unstack().T.groupby(level=0)

Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source

def beta(df):
    # first column is the market
    X = df.values[:, [0]]
    # prepend a column of ones for the intercept
    X = np.concatenate([np.ones_like(X), X], axis=1)
    # matrix algebra
    b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
    return pd.Series(b[1], df.columns[1:], name='Beta')

Demonstration

rdf = roll(df, 12)
betas = rdf.apply(beta)

Timing

Validation
Compare calculations with OP

def calc_beta(df):
    np_array = df.values
    m = np_array[:,0] # market returns are column zero from numpy array
    s = np_array[:,1] # stock returns are column one from numpy array
    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta

print(calc_beta(df.iloc[:12, :2]))

-0.311757542437

print(beta(df.iloc[:12, :2]))

s0001   -0.311758
Name: Beta, dtype: float64

Note the first cell
Is the same value as validated calculations above

betas = rdf.apply(beta)
betas.iloc[:5, :5]

Response to comment
Full working example with simulated multiple dataframes

num_sec_dfs = 4000

cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}

market = pd.Series(np.random.rand(480), dates, name='Market')

df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)

betas = roll(df.pct_change().dropna(), 12).apply(beta)

for c, col in betas.iteritems():
    dfs[c]['Beta'] = col

dfs['s0001'].head(20)

edited May 23 '17 at 12:17

Community

1
1

answered Sep 15 '16 at 04:59

piRSquared

285,575
57
475
624

Thank you for the detailed response! However 2.3seconds per dataframe is about the speed I am getting now (x4000 = a long time). Is there a faster way? As mentioned I can do all 4000+ in under 3 minutes with SQL tables. – cwse Sep 15 '16 at 06:13
1

@cwse 2.3 seconds is for 240 months and 4,000 stocks. SQL can do it in under 3 minutes. pandas + numpy can do it in under 3 seconds – piRSquared Sep 15 '16 at 06:13
1

@cwse I just ran with 40,000 simulated stock returns over 240 months, running rolling 12 month betas in under 40 seconds. – piRSquared Sep 15 '16 at 06:22
1

WOW!! That's freaking epic! Thanks mate :) – cwse Sep 15 '16 at 06:25
@cwse glad I could help. – piRSquared Sep 15 '16 at 06:52
Pir, I just realised this is using one big dataframe... my data is in 4000 dataframes, and I want to add a new column (Beta) to each dataframe. – cwse Sep 15 '16 at 08:25
@cwse we can work with that. 4,000 dataframes with columns `[Date, Open, High, Low, Close]`? I'll update my post on what to do. I'm assuming beta is being calculated on `Close`, correct? – piRSquared Sep 15 '16 at 08:27
@cwse post has been updated. Keep in mind that the multiple dataframes are in a dictionary. – piRSquared Sep 15 '16 at 08:44
Dude you are a genious..! matrix algebra and everything.. havent done that for years..! – cwse Sep 15 '16 at 11:30
Tell me, why do you transpose and pivot the 3D array in the Roll function? – cwse Sep 15 '16 at 11:31
@cwse it starts with my choice to line the `items` axis of the `panel` with the different roll periods (12 of them), stocks with the `major` axis, and dates with the `minor` axis. After that, `pd.Panel.to_frame()` drops the `items` axis into column headers and stacks the prior columns for each item's dataframe into rows. Knowing this, and that I ulimately want the dataframe's index to have the first level be dates, second level be roll period and columns be stocks, I needed to `unstack` and `T` – piRSquared Sep 15 '16 at 15:27
this method doesnt appear to work, I get a "Merror Error" in the roll function "roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T " line, as follows: " File "C:\Users\cwse8\Anaconda3\lib\site-packages\numpy\lib\shape_base.py", line 368, in dstack return _nx.concatenate([atleast_3d(_m) for _m in tup], 2) MemoryError" – cwse Sep 18 '16 at 09:03
@cwse I can assure you it does work. I've demonstrated that it does. You may have a specific scenario in which it's failing and we can get to the bottom of that. You say memory error, how much memory do you have available and over how many securities and over what time frame are you trying to execute this? You can also, run it over chunks of your data. Do it over one tenth of your data at a time. Admittedly, this method isn't as memory efficient as it could be seeing as I take the entire time indexed dataframe and multiply it by 12. – piRSquared Sep 18 '16 at 09:07
It runs with less files, but the output isnt as expected. is it possible to message you directly? – cwse Sep 18 '16 at 10:10
how can I execute it on chunks of data? my laptop has 8gb memory , iam running python 64bit. Another issue, the beta function assumes the market data is in the first column, however the pd.concat puts it in the last column, so it isn't actually calculating on the market data.. – cwse Sep 18 '16 at 11:42
I fixed the Market data column issue by putting the following code in the beta function: i = df.columns.get_loc(MarketIndex) X = df.values[:, [i]] – cwse Sep 18 '16 at 11:49
I tried running on only 2500 files and I even got a windows popup that my computer is low on memory haha.. the program then failed – cwse Sep 18 '16 at 11:58
Hi piRsquared, your partition of dataframe logic will index the partition using the first entry of the partition. In a time series context (dates aligned from old to new). in the finance context you normally want a window of historical data that is backward looking wouldn't you? – casbby Nov 29 '16 at 22:48
@casbby you opened up many points for conversation. I'll try to be succinct. The beta calculation is made with contemporaneous series so the only potential issue with alignment is over the course of a single period. Meaning, "is this the beta for the end of this month or end of last month?" However, that depends on the nature of the returns. I generated random returns and did not specify if they were forward looking or backward looking. I'd leave that up the user to verify as it's very important. – piRSquared Nov 29 '16 at 22:58
@casbby I'm almost always looking at building strategies and I need to see how this portfolio today will perform over the subsequent period. That implies that I want to look forward. That gets me into trouble though because many people want to see how their portfolios did over the prior period. It takes much explanation but I still find it worth it to keep looking forward. – piRSquared Nov 29 '16 at 23:00
thanks for the quick reply. it is a piece of awesome algo. have you benchmarked it against the usual partition algo of using ix? i am sure your way is faster. just interested in finding out how much faster – casbby Nov 29 '16 at 23:01
@casbby I did, don't know where I've put it atm >. – piRSquared Nov 29 '16 at 23:03
your algo is a example of how functional form can be applied to make things better! – casbby Nov 29 '16 at 23:04
I appreciate you saying so. – piRSquared Nov 29 '16 at 23:06
I get the "TypeError: object() takes no parameters" when I try to run you code for line 9 in the roll function >>>minor_axis=pd.Index(range(w), name='roll')) Did something depracate? – Luca R Aug 14 '20 at 04:00
2

This looks so perfect for what I need to do, I was wondering if you were able to post an updated solution using DataFrames multi-level indexing now that Panel has been depreciated. Thanks :) – bgaerber Mar 19 '21 at 19:40

piRSquared · Answer 2 · 2016-09-20T06:02:36.217

9

Using a generator to improve memory efficiency

Simulated data

m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)

Beta Calculation

def beta(df, market=None):
    # If the market values are not passed,
    # I'll assume they are located in a column
    # named 'Market'.  If not, this will fail.
    if market is None:
        market = df['Market']
        df = df.drop('Market', axis=1)
    X = market.values.reshape(-1, 1)
    X = np.concatenate([np.ones_like(X), X], axis=1)
    b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
    return pd.Series(b[1], df.columns, name=df.index[-1])

roll function
This returns a generator and will be far more memory efficient

def roll(df, w):
    for i in range(df.shape[0] - w + 1):
        yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)

Putting it all together

betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T

Validation

OP beta calc

def calc_beta(df):
    np_array = df.values
    m = np_array[:,0] # market returns are column zero from numpy array
    s = np_array[:,1] # stock returns are column one from numpy array
    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta

Experiment setup

m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')

cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}

market = pd.Series(np.random.rand(m), dates, name='Market')

df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)

betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T

for c, col in betas.iteritems():
    dfs[c]['Beta'] = col

dfs['s0000'].head(20)

calc_beta(df[['Market', 's0000']])

0.0020118230147777435

NOTE:
The calculations are the same

edited Sep 20 '16 at 06:02

answered Sep 19 '16 at 05:26

piRSquared

285,575
57
475
624

thank you.. this seems to be getting there however the answer is incomplete? It all runs but I am not getting beta values that's for sure...I am getting decimals but they are not correct. No more 3D array and Panels in the roll function? – cwse Sep 19 '16 at 13:02
The memory efficiency issue got me thinking. This beta calc is the same but the roll is quite different. The panel from the other answer was a slick way to put everything together into a generic structure that could be calculated upon again and again. Using the generator only uses the slice of the dataframe it needs at the time of the beta calculation, then moves on. Anyway, I'll dbl chk the calcs. – piRSquared Sep 19 '16 at 13:15
I am recently having issues with the line: "df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)" in Python 3.4... it appears the iteration over dict keys is not working with the rename function...i get this error: TypeError: 'str' object is not callable (Python) I tried googling solutions and trying different iteration methods..list(), iter(), items().. over the dict and cant get it to work!! I hope you can help! Thank you!! – cwse Jan 10 '17 at 11:36
@cwse "df = pd.concat([market] + [dfs[k].Close.rename(k) for k, v in dfs.items()], axis=1).sort_index(1)" – piRSquared Jan 10 '17 at 16:25
@cwse "df = pd.concat([market] + [v.Close.rename(k) for k, v in dfs.items()], axis=1).sort_index(1)" – piRSquared Jan 10 '17 at 16:25
thanks for the quick response! THat is neater, but still getting the same error on that line: '...packages\pandas\core\internals.py", line 4391, in items = [func(x) for x in index] TypeError: 'str' object is not callable' any ideas? Thanks buddy! – cwse Jan 11 '17 at 00:45
I had to downgrade to Python 3.4.3 to use Zipline... could this be the reason?? So strange. But I need Zipline..! – cwse Jan 11 '17 at 00:52
I just ran it with 3.5 – piRSquared Jan 11 '17 at 00:56
could it be because the dataframe names are strings? i.e. to call a dataframe for a particular stock called CSL I do: print(dfs['CSL'] ) ...however this was working fine on newer versions of Python.. – cwse Jan 11 '17 at 01:06

mcguip · Answer 3 · 2019-02-16T17:55:42.770

While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.

The following optimizes only the subdivision of the data set into rolling windows:

def numpy_betas(x_name, window, returns_data, intercept=True):
    if intercept:
        ones = numpy.ones(window)

    def lstsq_beta(window_data):
        x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
        beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
        return beta_arr[0]

    indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
    return DataFrame(
        data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
        , columns=list(returns_data.columns)
        , index=returns_data.index[window - 1::1]
    )

The following also optimizes the beta calculation itself:

def custom_betas(x_name, window, returns_data):
    window_inv = 1.0 / window
    x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
    y_sum = returns_data.rolling(window, min_periods=window).sum()
    xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
    xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
    xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
    x_var = xx_sum - window_inv * numpy.square(x_sum)
    betas = xy_cov.divide(x_var, axis=0)[window - 1:]
    betas.columns.name = None
    return betas

Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:

Comparing the performance to that of @piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.

score 2 · Answer 4 · answered Apr 12 '19 at 04:51

Further optimizing on @piRSquared's implementation for both speed and memory. the code is also simplified for clarity.

from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range

def calc_beta(s: ndarray, m: ndarray):
  x = vstack((ones_like(m), m))
  b = pinv(x.dot(x.T)).dot(x).dot(s)
  return b[1]

def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
  result = ndarray(shape=s_df.shape, dtype=float)
  l, w = s_df.shape
  ls, ws = s_df.values.strides
  result[0:period - 1, :] = nan
  s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
  m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
  for row in range(period, l):
    result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
  return DataFrame(data=result, index=s_df.index, columns=s_df.columns)

if __name__ == '__main__':
  num_sec_dfs, num_periods = 4000, 480

  dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
  stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
                   columns=['s{:04d}'.format(i) for i in 
                            range(num_sec_dfs)]).pct_change()
  market = DataFrame(data=random.rand(num_periods), index=dates, columns= 
              ['Market']).pct_change()
  betas = rolling_calc_beta(stocks, market, 12)

%timeit betas = rolling_calc_beta(stocks, market, 12)

335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score 2 · Answer 5 · answered Dec 02 '20 at 15:55

HERE'S THE SIMPLEST AND FASTEST SOLUTION

The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.

I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.

If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:

# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()

After that, it's just covariance divided by variance.


ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance

NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.

score 0 · Answer 6 · answered Feb 16 '21 at 04:44

Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):

beta = covariance(returns, benchmark returns) / variance(benchmark returns)

Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:

import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)

which will give you a dataframe of date and beta or the last beta value if tail is true.

score -1 · Answer 7 · answered Oct 12 '18 at 12:36

but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.

Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.

The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.

Thus running again would result stupid move unless the calculations are cached.

numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.

A Cython implementation is must for huge set of data.

In my opinion, this should have been posted as comments to the question, because it isn't really a coherent answer, just a collection of remarks. If you don't have comment privilege yet, then I would advise you to go earn the few reputation points needed to acquire this privilege. — Jean-François Corbett, Oct 12 '18 at 13:00

Efficient Python Pandas Stock Beta Calculation on Many Dataframes

7 Answers7

Validation

Linked

Related