Using Pandas 1.0 and Numpy 1.18, I need to apply Rolling multiple times, with a varying window size and summary functions to a large dataframe with a large number of groups. Before applying the summary function the Series is also shifted by 1 to discard the current row value. This is an example for a rolling max shifted by 1:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [5,2,4,5,4,2,3,5,5,2,4,1], 'b': [18,37,60,45,40,40,50,10,30,2,46,19]})
df = df.sort_values('a').reset_index(drop=True)
df['max'] = df.groupby('a', sort=False, as_index=False)['b'].rolling(2, min_periods=1).apply(lambda x: np.max(x[:-1])).reset_index(drop=True)
Result:
df
a b max
0 1 19 NaN
1 2 37 NaN
2 2 40 37.0
3 2 2 40.0
4 3 50 NaN
5 4 60 NaN
6 4 40 60.0
7 4 46 40.0
8 5 18 NaN
9 5 45 18.0
10 5 10 45.0
11 5 30 10.0
The result is correct, but it takes too long once it is applied to a large dataframe and I was wondering if there is a way to refactor this logic to make use of vectorization instead of relying on apply
, which, as I read, is implemented as a loop under the hood and it performs poorly.