Rolling idxmin/max for pandas DataFrame

Question

I believe the following function is a working solution for pandas DataFrame rolling argmin/max:

import numpy as np

def data_frame_rolling_arg_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws).apply(getattr(np, f'arg{func}'))[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))

It is inspired from a partial solution for rolling idxmax on pandas Series.

Explanations:

Apply the numpy argmin/max function to the rolling window.
Only keep the non-NaN values.
Convert the values to int.
Realign the values to original row numbers.
Use applymap to replace the row numbers by the index values.
Combine with the original DataFrame filled with NaN in order to add the first rows with expected NaN values.


In [1]: index = map(chr, range(ord('a'), ord('a') + 10))

In [2]: df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)

In [3]: df                                                                                                                                                                                                                                                                       
Out[3]: 
    0   1   2
a  -4  15   0
b   0  -6   4
c   7   8 -18
d  11  12 -16
e   6   3  -6
f  -1   4  -9
g   6 -10  -7
h   8  11 -25
i  -2 -10  -8
j   0  10  -7

In [4]: data_frame_rolling_arg_func(df, 3, 'max')                                                                                                                                                                                                                                
Out[4]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    a    b
d    d    d    b
e    d    d    e
f    d    d    e
g    e    f    e
h    h    h    g
i    h    h    g
j    h    h    j

In [5]: data_frame_rolling_arg_func(df, 3, 'min')                                                                                                                                                                                                                                
Out[5]: 
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    b    c
d    b    b    c
e    e    e    c
f    f    e    d
g    f    g    f
h    f    g    h
i    i    g    h
j    i    i    h

My question are:

Can you find any mistakes?
Is there a better solution? That is: more performant and/or more elegant.

And for pandas maintainers out there: it would be nice if the already great pandas library included rolling idxmax and idxmin.

A little bit of self-critique after testing on real data: the OP's implementation won't work with only `NaN` values in a column for a window, and it seems to take orders of magnitude longer than a rolling max. Unless the performance part can be solved I will need to limit the function to a few rows (which is what I need), and I will have to solve the `NaN`issue. — nilo, Jan 01 '21 at 01:24

nilo · Answer 1 · 2021-01-01T21:43:04.963

The NaN issue I mentioned in a comment to the OP can be solved in the following manner:

import numpy as np
import pandas as pd


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1
    return (df.rolling(ws, min_periods=0).apply(getattr(np, f'arg{func}'),
                                                raw=True)[wm1:].astype(int) +
            np.array([np.arange(len(df) - wm1)]).T).applymap(
                lambda x: df.index[x]).combine_first(df.applymap(lambda x: np.NaN))


def main():
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_arg_func(df, 3, 'min'))
    print(data_frame_rolling_arg_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Result:

$ python demo.py 
      0   1   2
a   3.0   0   7
b   1.0   3  11
c   1.0  15  -6
d   NaN   2 -16
e   NaN   0  24
f   NaN   0  14
g   2.0   0   4
h  -1.0 -11  16
i  17.0   0  -2
j   3.0  -5  -8
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    b    a    c
d    d    d    d
e    d    e    d
f    d    e    d
g    e    e    g
h    f    h    g
i    h    h    i
j    h    h    j
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    c    b
d    d    c    b
e    d    c    e
f    d    d    e
g    e    e    e
h    f    f    h
i    i    g    h
j    i    i    h

The handling of NaN values is a little subtle. I want my rolling idxmin/max function to cooperate well with the regular DataFrame rolling min/max functions. These, by default, will generate a NaN value as soon as the window input shows a NaN value. And so will the rolling apply function by default. But for the apply function, that is a problem, because I will not be able to transform the NaN value into an index. However this is a pity, since the NaN values in the output show up because they can be found in the input, so the NaN value index in the input is what I would like my rolling idxmin/max function to produce. Fortunately, this is exactly what I will get if I use the following combination of parameters:

min_periods=0 for the pandas rolling function. The apply function will then get a chance to produce its own value regardless of how many NaN values are found in the input window.
raw=True for the apply function. This parameter ensures that the input to the applied function is passed as a numpy array instead of a pandas Series. np.argmin/max will then return the index of the first input NaN value, which is exactly what we want. It should be noted that without raw=True, i.e. in the pandas Series case, np.argmin/max seems to ignore the NaN values, which is NOT what we want. The nice thing with raw=True is that it should improve performance too! More about that later.

Measurements on this solution seem to indicate that the running time for calculating rolling `idxmin` /`idxmax` in this way is five times the running time necessary to calculate the corresponding rolling `min`/`max`. This is quite high, but not quite "orders of magnitude longer". I guess this is thanks to `raw=True`. I will use this for now, but if anyone has a better proposition, I'm all ears. — nilo, Jan 01 '21 at 22:57

score 0 · Answer 2 · answered Jan 02 '21 at 20:00

The solution in my previous answer manages to give proper index values for NaN input values, but I have realized that this is most probably not what a native pandas rolling idxmin/idxmax would do by default. By default, it would produce a NaN value if there is one or more NaN values in the window.

I came up with a variant of my solution, which does that:

import numpy as np
import pandas as pd


def transform_if_possible(func):
    def f(i):
        try:
            return func(i)
        except ValueError:
            return i
    return f


int_if_possible = transform_if_possible(int)


def data_frame_rolling_idx_func(df, window_size, func):
    ws = window_size
    wm1 = window_size - 1

    index_if_possible = transform_if_possible(lambda i: df.index[i])

    return (df.rolling(ws).apply(getattr(np, f'arg{func}'), raw=True).applymap(int_if_possible) +
            np.array([np.arange(len(df)) - wm1]).T).applymap(index_if_possible)


def main():
    print(int_if_possible(1.2))
    print(int_if_possible(np.NaN))
    index = map(chr, range(ord('a'), ord('a') + 10))
    df = pd.DataFrame((10 * np.random.randn(10, 3)).astype(int), index=index)
    df[0][3:6] = np.NaN
    print(df)
    print(data_frame_rolling_idx_func(df, 3, 'min'))
    print(data_frame_rolling_idx_func(df, 3, 'max'))


if __name__ == "__main__":
    main()

Results:

1
nan
      0   1   2
a  15.0  -2  13
b  -6.0  -4  -3
c -12.0  -7  -8
d   NaN   0  -4
e   NaN  -1 -11
f   NaN  -9  10
g  -1.0  24   1
h -15.0  14 -16
i   7.0  -4  14
j  -1.0   4  10
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    c    c    c
d  NaN    c    c
e  NaN    c    e
f  NaN    f    e
g  NaN    f    e
h  NaN    f    h
i    h    i    h
j    h    i    h
     0    1    2
a  NaN  NaN  NaN
b  NaN  NaN  NaN
c    a    a    a
d  NaN    d    b
e  NaN    d    d
f  NaN    d    f
g  NaN    g    f
h  NaN    g    f
i    i    g    i
j    i    h    i

To achieve my goal, I am using two functions to transform values into integers, and row numbers into index values, respectively, which leave NaN unchanged. I construct these functions with the help of a common closure, transform_if_possible. In the second case, since the index transformation is dependent on the DataFrame, I construct the transformation function from a local lambda function.

Apart from these aspects, the solution is similar to my previous one, but since NaN is explicitly handled, I know longer need a special handling of the first window_size - 1 rows, so the code is a little shorter.

A nice side effect of this solution is that the running time seems to be lower: a little over three times the running time of the corresponding rolling min/max, instead of five times.

All in all, a better solution I think.

I just discovered the option `a_action='ignore'` in `DataFrame.applymap` in pandas 1.2, which I automatically get if I install my virtualenv with Python 3.9. This is great, since it makes my explicit 'NaN' handling above redundant, and the resulting code's running time is divided by 2.5 or so! The code adaptation is left as an exercise, since I seem to be the only one interested by this issue anyway... — nilo, Jan 04 '21 at 17:15

score 0 · Answer 3 · answered Nov 18 '21 at 03:46

A monotonic deque can solve it in O(N),

def get_rolling_idxmin(l_input:[], window_size:int) -> [(int,float)]:
    res = []
    deq:[(int, float)] = []
    n = len(l_input)
    for i in range(n):
        v = l_input[i]
        if len(deq) and (i - deq[0][0]) >= window_size:
            deq.pop(0)
        while len(deq) and v <= deq[-1][1]: 
            deq.pop(-1)
        deq.append((i,v))
        res.append((deq[0][0],deq[0][1]))
    return res

l_min = get_rolling_idxmin(df.bp1[::-1].to_list(), 50)
df_min = pd.DataFrame(l_min, columns=['index_min', 'value_min'])
df_min['index_min'] = df_min.shape[0]-1-df_min.index_min
df_min = df_min[::-1]
df_min.reset_index(drop=True, inplace=True)
# print(df_min)
df = pd.concat([df,df_min], axis=1)

Rolling idxmin/max for pandas DataFrame

3 Answers3

Linked