1

Using Pandas 1.0 and Numpy 1.18, I need to apply Rolling multiple times, with a varying window size and summary functions to a large dataframe with a large number of groups. Before applying the summary function the Series is also shifted by 1 to discard the current row value. This is an example for a rolling max shifted by 1:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [5,2,4,5,4,2,3,5,5,2,4,1], 'b': [18,37,60,45,40,40,50,10,30,2,46,19]})

df = df.sort_values('a').reset_index(drop=True)

df['max'] = df.groupby('a', sort=False, as_index=False)['b'].rolling(2, min_periods=1).apply(lambda x: np.max(x[:-1])).reset_index(drop=True)

Result:

df
    a   b   max
0   1  19   NaN
1   2  37   NaN
2   2  40  37.0
3   2   2  40.0
4   3  50   NaN
5   4  60   NaN
6   4  40  60.0
7   4  46  40.0
8   5  18   NaN
9   5  45  18.0
10  5  10  45.0
11  5  30  10.0

The result is correct, but it takes too long once it is applied to a large dataframe and I was wondering if there is a way to refactor this logic to make use of vectorization instead of relying on apply, which, as I read, is implemented as a loop under the hood and it performs poorly.

Javide
  • 2,477
  • 5
  • 45
  • 61
  • This looks like a decent solution. If you say "too long", what do you mean exactly – Erfan Jun 17 '20 at 10:41
  • About a minute. I have a few of these rolling operations and 15 minute constraint on the overall execution of my script. I am not complaining about the performance of the Pandas groupby/apply functions, I am just trying to find an optimal solution and found on other operations that vectorization dramatically reduce execution time. I just don't know how Rolling and shifting operations like the one I illustrated can be refactored by vectorization and am seeking advice. – Javide Jun 17 '20 at 10:55
  • Does this answer your question? [Taking subarrays from numpy array with given stride/stepsize](https://stackoverflow.com/questions/40084931/taking-subarrays-from-numpy-array-with-given-stride-stepsize) – Joe Jun 17 '20 at 11:59
  • https://stackoverflow.com/questions/4923617/efficient-numpy-2d-array-construction-from-1d-array – Joe Jun 17 '20 at 11:59
  • https://stackoverflow.com/questions/52149479/time-series-data-preprocessing-numpy-strides-trick-to-save-memory – Joe Jun 17 '20 at 12:00

0 Answers0