17

EDIT: This question was asked in 2016 and similar questions have been posted on SO years later after the functionality was finally removed, e.g. module 'pandas' has no attribute 'rolling_mean'

However, the question concerns performance of the new pd.rolling.mean() and should stay open until the associated pandas issue is fixed.


It looks like pd.rolling_mean is becoming deprecated for ndarrays,

 pd.rolling_mean(x, window=2, center=False)

FutureWarning: pd.rolling_mean is deprecated for ndarrays and will be removed in a future version

but it seems to be the fastest way of doing this, according to this SO answer.

Are there now new ways of doing this directly with SciPy or NumPy that are as fast as pd.rolling_mean?

saladi
  • 3,103
  • 6
  • 36
  • 61
  • I still don't see an answer to the question "What is an alternative rolling_mean function for ndarrays?" This should be included in scipy or numpy without needing to rely on a Pandas function intended for use on Dataframes – Mike May 25 '16 at 19:42

5 Answers5

10

EDIT -- Unfortunately, it looks like the new way is not nearly as fast:

New version of Pandas:

In [1]: x = np.random.uniform(size=100)

In [2]: %timeit pd.rolling_mean(x, window=2)
1000 loops, best of 3: 240 µs per loop

In [3]: %timeit pd.Series(x).rolling(window=2).mean()
1000 loops, best of 3: 226 µs per loop

In [4]: pd.__version__
Out[4]: '0.18.0'

Old version:

In [1]: x = np.random.uniform(size=100)

In [2]: %timeit pd.rolling_mean(x,window=2)
100000 loops, best of 3: 12.4 µs per loop

In [3]: pd.__version__
Out[3]: u'0.17.1'
saladi
  • 3,103
  • 6
  • 36
  • 61
  • good point and it looks like you're right. See my edit. I'm going to open the question up again to see if anyone else has a solution here that retains the older speed. – saladi Mar 29 '16 at 03:41
  • 1
    dang yeah that sucks ! – maxymoo Mar 29 '16 at 04:08
  • See here: this *should* only add a tiny bit of function call overhead, but this has an uncessary copy of the internal blocks, easy fix: https://github.com/pydata/pandas/issues/12732 – Jeff Mar 29 '16 at 16:51
  • This is horrible syntax... we went from simple and terse, to something verbose and unpythonic. – Merlin May 26 '16 at 02:36
  • 2
    I almost agree - but the new syntax means that we can apply *any* function to that window, not just the precanned ones. – Contango Jul 31 '17 at 02:44
  • How can I use pd.series(x), for 3D array. Here x is 3D numpy array. – Prvt_Yadav Mar 26 '19 at 04:55
5

Looks like the new way is via methods on the DataFrame.rolling class (I guess you're meant to think of it sort of like a groupby): http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html

e.g.

x.rolling(window=2).mean()
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Yeah, I realized that. Should've included this in the question. In any case, it turns out it's just as fast even though it requires explicitly turning `x` into a `pd.Series` first (See my answer with details). – saladi Mar 29 '16 at 03:17
  • How can I use pd.series(x), for 3D array. Here x is 3D numpy array. – Prvt_Yadav Mar 26 '19 at 04:56
1

try this

x.rolling(window=2, center=False).mean()
Pruce Uchiha
  • 543
  • 1
  • 4
  • 7
0

I suggest scipy.ndimage.filters.uniform_filter1d like in my answer to the linked question. It is also way faster for large arrays:

import numpy as np
from scipy.ndimage.filters import uniform_filter1d
N = 1000
x = np.random.random(100000)

%timeit pd.rolling_mean(x, window=N)
__main__:257: FutureWarning: pd.rolling_mean is deprecated for ndarrays and will be removed in a future version
The slowest run took 84.55 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 7.37 ms per loop

%timeit uniform_filter1d(x, size=N)
10000 loops, best of 3: 190 µs per loop
Community
  • 1
  • 1
moi
  • 1,835
  • 2
  • 18
  • 25
-2

If your dimensions are homogeneous, you could try to implement an n-dimensional form of the Summed Area Table used for bidimensional images:

A summed area table is a data structure and algorithm for quickly and efficiently generating the sum of values in a rectangular subset of a grid.

Then, in this order, you could:

  1. Create the summed area table ("integral") of your array;
  2. Iterate to get the (quite cheap) sum of a n-dimensional kernel at a given position;
  3. Divide by the size of the n-dimensional volume of the kernel.

Unfortunately I cannot know if this is efficient or not, but the by the given premise, it should be.

heltonbiker
  • 26,657
  • 28
  • 137
  • 252