As to your three questions:
- Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
- Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
- Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the
value
attribute or to_numpy()
method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.
Why is the loop algorithm so slow?
In your loop algorithm, mean
is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling
method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.
You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:
import pandas as pd
import numpy as np
import time
data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))
start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop: {}'.format(time.time() - start))
def np_sma(a, n=14) :
ret = np.cumsum(a)
ret[n:] = ret[n:] - ret[:-n]
return np.append([np.nan]*(n-1), ret[n-1:] / n)
start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np: {}'.format(time.time() - start))
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)
Output:
df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop: 7.605962753295898
np: 0.0010571479797363281