Improving upon this question which provided a clever solution for applying a function over multiple columns in a DataFrame, I'm wondering if the solution can be further optimized for speed.
Environment: Python 2.7.8, Pandas 14.1, Numpy 1.8.
Here's the example setup:
import pandas as pd
import numpy as np
import random
def meanmax(ii,df):
xdf = df.iloc[map(int,ii)]
n = max(xdf['A']) + max(xdf['B'])
return n / 2.0
df = pd.DataFrame(np.random.randn(2500,2)/10000,
index=pd.date_range('2001-01-01',periods=2500),
columns=['A','B'])
df['ii'] = range(len(df))
res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
Note that the meanmax
function is not pairwise, thus something like rolling_mean(df['A'] + df['B'],26)
won't work.
However I can do something like:
res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2
Which completes roughly 3000x faster:
%timeit res = pd.rolling_apply(df.ii, 26, lambda x: meanmax(x, df))
1 loops, best of 3: 1 s per loop
%timeit res2 = (pd.rolling_max(df['A'],26) + pd.rolling_max(df['B'],26)) / 2
1000 loops, best of 3: 325 µs per loop
Is there anything better/equivalent than the second option above, given the example function and using rolling_apply
? While the second option is faster, it doesn't use a rolling_apply
, which can be applied to a wider problem set
Edit: Performance timing correction