4

I have a pandas DataFrame and I want to calculate on a rolling basis the average of all the value: for all the columns, for all the observations in the rolling window.

I have a solution with loops but feels very inefficient. Note that I can have NaNs in my data, so calculating the sum and dividing by the shape of the window would not be safe (as I want a nanmean).

Any better approach?

Setup

import numpy as np
import pandas as pd

np.random.seed(1)

df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])

df[df>5] = np.nan  # EDIT: add nans

My Attempt

n_roll = 2

df_stacked = df.values
roll_avg = {}
for idx in range(n_roll, len(df_stacked)+1):
    roll_avg[idx-1] = np.nanmean(df_stacked[idx - n_roll:idx, :].flatten())

roll_avg = pd.Series(roll_avg)
roll_avg.index = df.index[n_roll-1:]
roll_avg = roll_avg.reindex(df.index)

Desired Result

roll_avg
Out[33]: 
0         NaN
1    5.000000
2    1.666667
3    0.333333
4    1.000000
5    3.000000
6    3.250000
7    3.250000
8    3.333333
9    4.000000

Thanks!

mck
  • 40,932
  • 13
  • 35
  • 50
FLab
  • 7,136
  • 5
  • 36
  • 69
  • [using-rolling-on-multiple-columns](https://stackoverflow.com/questions/51615849/pandas-using-rolling-on-multiple-columns/51616256#51616256) may be of use. You can either use `.shift`, or stack and then roll over a bigger window. – ALollz Aug 15 '18 at 14:35

3 Answers3

3

Here's one NumPy solution with sliding windows off view_as_windows -

from skimage.util.shape import view_as_windows

# Setup o/p array
out = np.full(len(df),np.nan)

# Get sliding windows of length n_roll along axis=0
w = view_as_windows(df.values,(n_roll,1))[...,0]

# Assign nan-ignored mean values computed along last 2 axes into o/p
out[n_roll-1:] = np.nanmean(w, (1,2))

Memory efficiency with views -

In [62]: np.shares_memory(df,w)
Out[62]: True
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks a lot for the answer. Any way I can can avoid using skimage? does not seem to be included in standard anaconda distribution/my current environment – FLab Aug 15 '18 at 15:15
  • @FLab You can use the [`source code`](https://github.com/scikit-image/scikit-image/blob/master/skimage/util/shape.py#L106). – Divakar Aug 15 '18 at 15:20
  • 1
    @FLab Or use [`strided_axis0`](https://stackoverflow.com/a/43413801/3293881) : `np.nanmean(strided_axis0(df.values, n_roll),(1,2))`. – Divakar Aug 15 '18 at 15:22
  • thanks it is working with strided_axis0. I just noticed that if all the values in the window are nan I have a RuntimeWarning: Mean of empty slice from code import InteractiveConsole and this seems to affect time performance, so it is not as quick as the 'stack' solution I posted – FLab Aug 21 '18 at 07:55
  • @FLab Not sure how warnings can affect performance. How about turn off warnings at the top? – Divakar Aug 21 '18 at 07:58
0

To be able to get the same result in case of nan, you can use column_stack on all the df.shift(i).values for i in range(n_roll), use nanmean on axis=1, and then you need to replace the first n_roll-1 value with nan after:

roll_avg = pd.Series(np.nanmean(np.column_stack([df.shift(i).values for i in range(n_roll)]),1))
roll_avg[:n_roll-1] = np.nan

and with the second input with nan, you get as expected

0         NaN
1    5.000000
2    1.666667
3    0.333333
4    1.000000
5    3.000000
6    3.250000
7    3.250000
8    3.333333
9    4.000000
dtype: float64
Ben.T
  • 29,160
  • 6
  • 32
  • 54
0

Using the answer referenced in the comment, one can do:

wsize = n_roll
cols = df.shape[1]
out = group.stack(dropna=False).rolling(window=wsize * cols, min_periods=1).mean().reset_index(-1, drop=True).sort_index()
out.groupby(out.index).last()
out.iloc[:nroll-1] = np.nan

In my case it was important to specify dropna=False in stack, otherwise the length of the rolling window would not be correct.

But I am looking forward to other approaches as this does not feel very elegant/efficient.

FLab
  • 7,136
  • 5
  • 36
  • 69