Why is Pandas so madly fast? How to define such functions?

Question

I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.

My code:

#df_1h has been imported before

import time

n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))

close = df_1h['close']

start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))

start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
    df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))

display(df_1h.tail())

Output:

"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
        open_time       open        high        low         ... ignore  rsi_14  sma_14_pandas   sma_14_loop
16593   1.562980e+12    11707.39    11739.90    11606.04    ... 0.0 51.813151   11646.625714    11646.625714
16594   1.562983e+12    11664.32    11712.61    11625.00    ... 0.0 49.952679   11646.834286    11646.834286
16595   1.562987e+12    11632.64    11686.47    11510.00    ... 0.0 47.583619   11643.321429    11643.321429
16596   1.562990e+12    11582.06    11624.04    11500.00    ... 0.0 48.725262   11644.912857    11644.912857
16597   1.562994e+12    11604.96    11660.00    11588.16    ... 0.0 50.797087   11656.723571    11656.723571
5 rows × 15 columns

Pandas almost faster than 2.5k times!!!

My Questions:

Is my code wrong?
If my code is correct, why is Pandas so fast?
How to define custom functions that run so fast for Pandas?

Pandas' rolling function specifically is *very fast*, way faster than looping as your timings show — yatu, Jul 13 '19 at 11:09
As a side note, you should avoid using chained indexing and use panda's `loc` and `iloc` indexers instead, find more [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) — yatu, Jul 13 '19 at 11:12
@yatu I fully support your recommendation, but I noticed that replacing `df_1h['SMA_14_loop'][i] =...` by `df_1h.loc[i,'SMA_14_loop'] = ...` increased the execution time of the loop algorithm by 30 % (from 7.6 to 10.1 s in the test described in my answer). Do you know of any similar comparisons or do you have an explanation for it? — Stef, Jul 13 '19 at 18:46

Stef · Accepted Answer · 2019-07-13T21:35:33.410

As to your three questions:

Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the value attribute or to_numpy() method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.

Why is the loop algorithm so slow?

In your loop algorithm, mean is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.

You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:

import pandas as pd
import numpy as np
import time

data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))

start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))

start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
    df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop:   {}'.format(time.time() - start))

def np_sma(a, n=14) :
    ret = np.cumsum(a)
    ret[n:] = ret[n:] - ret[:-n]
    return np.append([np.nan]*(n-1), ret[n-1:] / n)

start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np:     {}'.format(time.time() - start))

assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)

Output:

df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop:   7.605962753295898
np:     0.0010571479797363281

Thanks for awesome answer. Can you take a look at another one here, please. https://stackoverflow.com/q/57006437/6452246 — Thai D. V., Jul 14 '19 at 12:31

Why is Pandas so madly fast? How to define such functions?

My code:

Output:

My Questions:

1 Answers1

As to your three questions:

Why is the loop algorithm so slow?

Linked