Very slow computation on a Dataframe based on previous rows

Question

I am calculating the RSI value for a stock price where a previous row is needed for the result of the current row. I am doing it currently via for looping the full dataframe for as many times there are entries which takes a lot of time (executing time on my pc around 15 seconds).

Is there any way to improve that code?

import pandas as pd
from pathlib import Path


filename = Path("Tesla.csv")
test = pd.read_csv(filename)
data = pd.DataFrame(test[["Date","Close"]])
data["Change"] = (data["Close"].shift(-1)-data["Close"]).shift(1)
data["Gain"] = 0.0
data["Loss"] = 0.0
data.loc[data["Change"] >= 0, "Gain"] = data["Change"]
data.loc[data["Change"] <= 0, "Loss"] = data["Change"]*-1
data.loc[:, "avgGain"] = 0.0
data.loc[:, "avgLoss"] = 0.0
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()


for index in data.iterrows():
    data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
    data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14

The used dataset can be downloaded here: TSLA historic dataset from yahoo finance

The goal is to calculate the RSI value based on the to be calculated avgGain and avgLoss value. The avgGain value on rows 0:14 are not existent. The avgGain value on row 15 is the mean value of row[1:14] of the Gain column. The avgGain value from row 16 onwards is calculated as: (13*avgGain(row before)+Gain(current row))/14

You could speed things up by using [`apply`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) rather than iterating over the rows. — RJ Adriaansen, Apr 04 '21 at 19:44
I'm not totally clear on what this loop is trying to achieve though. `index` is never actually used. It looks like two lines would achieve the same thing by themselves outside the loop. — tdy, Apr 04 '21 at 19:58
Could you provide a reproducible example with a small data set? — Orkun Berk Yuzbasioglu, Apr 04 '21 at 22:04
Could you either enter some data from the `tesla.csv` manually to the script or upload it somewhere on the internet such as google drive spreadsheets. It is still unreproducible. — Orkun Berk Yuzbasioglu, Apr 05 '21 at 14:55
Here the data in google drive spreadsheets: https://docs.google.com/spreadsheets/d/1m6r_9vKa9Srv9CMUobq2sVR6lB9CXkcPZ6zp2PgIFoY/edit?usp=sharing — soderdaen, Apr 08 '21 at 21:43

Orkun Berk Yuzbasioglu · Answer 1 · 2021-04-05T15:52:02.397

'itertuples' is faster than 'iterrows' and vectorized operations generally performs best in terms of time.

Here you can calculate average gains and losses (rolling averages) over 14 days with the rolling method with a window size of 14.

%%timeit
data["avgGain"].iat[14] = data["Gain"][1:15].mean()
data["avgLoss"].iat[14] = data["Loss"][1:15].mean()

for index in data.iterrows():
    data.loc[15:, "avgGain"] = (data.loc[14:, "avgGain"].shift(1)*13 + data.loc[15:, "Gain"])/14
    data.loc[15:, "avgLoss"] = (data.loc[14:, "avgLoss"].shift(1)*13 + data.loc[15:, "Loss"])/14

1.12 s ± 3.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
data['avgGain_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)
data['avgLos_alt'] = data['Gain'].rolling(window=14).mean().fillna(0)

1.38 ms ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

data.head(15)

Using vectorized operations to calculate moving averages is approximately 10 times faster than calculating with loops.

However note that, there is also some calculation error in your code for the averages after the first one.

I have seen the rolling function (in combination with the mean functio) but it doesn't give the same value is wanted. Easily said the function I need is: newAvgGainValue = (lastAvgGainValue*13+currentGainValue)/14 which leads to an impact of even the very first value to the very last value. It doesn't make that much sense imho, but that's how its calculated. (ok there are also some other ways to calculate the rolling average for RSI, but the comparison hompage uses the calculation mentioned above) — soderdaen, Apr 08 '21 at 21:29

Very slow computation on a Dataframe based on previous rows

1 Answers1