1

The need arose because I want to calculate something like:

df['result'] = np.where(df['data1'] == True, df["data2"].rolling(window).max(), 0)

However: window is practically df['window']; a Series of data in the same time series; it contains integers and they may look something like [2,2,2,3,3,2,2,4,2,2].

What's the most vectorized or other efficient way to do this?

j riv
  • 3,593
  • 6
  • 39
  • 54
  • 2
    Please present a fuller example with data we can play with and what your desired output should be given that example. – piRSquared Apr 08 '22 at 18:41
  • @piRSquared yeah I may do that later since I know it helps for developing answers. In the meantime: I think this question is almost exactly the same topic https://stackoverflow.com/questions/57064501, however: they don't have a vectorized version of an answer there (though I do not know if it exists). – j riv Apr 08 '22 at 18:52
  • Don't you worry, I've got a vectorized solution for you. – piRSquared Apr 08 '22 at 21:48
  • It appears it's extremely slow for large dataframes with any method I tried, possibly because SIMD hardware may prefer a constant nature of windows, but I hope I'm wrong. – j riv Apr 09 '22 at 04:58
  • Since the problem shifted purely to performance, I made a new question for that specifcially, https://stackoverflow.com/questions/71805439 – j riv Apr 09 '22 at 05:15

1 Answers1

1

Setup

import pandas as pd
import numpy as np

np.random.seed([3,14])
a = np.random.randn(20).cumsum()
w = np.minimum(
    np.random.randint(1, 4, size=a.shape),
    np.arange(len(a))+1
)

df = pd.DataFrame({'Data': a, 'Window': w})
df

        Data  Window
0  -0.602923       1
1  -1.005579       2
2  -0.703250       3
3  -1.227599       1
4  -0.683756       1
5  -0.670621       2
6  -0.997120       1
7   0.387956       3
8   0.255502       1
9  -0.152361       2
10  1.150534       3
11  0.546298       3
12  0.302936       3
13  0.091674       1
14 -1.964947       1
15 -1.447079       2
16 -1.487828       1
17 -2.539703       1
18 -1.932612       3
19 -4.163049       2

Vectorized variable window rolling maximum

idx_base = np.arange(len(df))
windows = df.Window.to_numpy()
data = df.Data.to_numpy()

idx_max = np.concatenate([
    np.arange(position-(window_size-1), position+1)
    for window_size, position in zip(windows, idx_base)
])

idx_pos = np.repeat(idx_base, windows)
data_windowed = data[idx_max]

out = np.full(data.shape, data.min())
np.maximum.at(out, idx_pos, data_windowed)

The answer is now in the out variable. Let's look

df.Data.iloc[idx_max].groupby(idx_pos).max().to_frame('Pandas').assign(Numpy=out)

      Pandas     Numpy
0  -0.602923 -0.602923
1  -0.602923 -0.602923
2  -0.602923 -0.602923
3  -1.227599 -1.227599
4  -0.683756 -0.683756
5  -0.670621 -0.670621
6  -0.997120 -0.997120
7   0.387956  0.387956
8   0.255502  0.255502
9   0.255502  0.255502
10  1.150534  1.150534
11  1.150534  1.150534
12  1.150534  1.150534
13  0.091674  0.091674
14 -1.964947 -1.964947
15 -1.447079 -1.447079
16 -1.487828 -1.487828
17 -2.539703 -2.539703
18 -1.487828 -1.487828
19 -1.932612 -1.932612

Appendix

Look at idx_max. You'll see how we end up vectorizing this.

print(idx_max)

[           #  Position   Window
  0         #         0        1
  0  1      #         1        2
  0  1  2   #         2        3
  3         #         3        1
  4         #         4        1
  4  5      #         5        2
  6         #         6        1
  5  6  7   #         7        3
  8         #         8        1
  8  9      #         9        2
  8  9 10   #        10        3
  9 10 11   #        11        3
 10 11 12   #        12        3
 13         #        13        1
 14         #        14        1
 14 15      #        15        2
 16         #        16        1
 17         #        17        1
 16 17 18   #        18        3
 18 19      #        19        2
]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks since this appears to produce the same results with an apply() method. However: it's still extremely slow to the point of being unusable compared to the performance of a rolling().max given a constant integer window. But I suspect it's impossible to be fast because SIMD hardware may prefer that constant nature of windows. – j riv Apr 09 '22 at 04:51
  • Since the question shifted purely to performance, I accepted this answer and made a new question specifically for performance, https://stackoverflow.com/questions/71805439 – j riv Apr 09 '22 at 05:16
  • Have you explored the use of Numba at all? – piRSquared Apr 09 '22 at 08:12
  • No, but I believe the first answer there does directly use numba and compares it to cython, though it brought an issue with correctness of results. – j riv Apr 09 '22 at 10:45
  • I noticed something odd with this method: while it's still noticeably slower than the method https://stackoverflow.com/a/71806274/277716 found at the new question, I get the vague impression it's now faster than it was when I first tested and the only thing different I can think of is installing mumba/llvmlite but it remains the same after they are uninstalled so it might be unrelated. It's not crucially important though since that other method remains more performant (they do appear to produce the same results though). – j riv Apr 09 '22 at 12:46