226

There seems to be no function that simply calculates the moving average on numpy/scipy, leading to convoluted solutions.

My question is two-fold:

  • What's the easiest way to (correctly) implement a moving average with numpy?
  • Since this seems non-trivial and error prone, is there a good reason not to have the batteries included in this case?
loopbackbee
  • 21,962
  • 10
  • 62
  • 97
  • 3
    Possible duplicate of [Moving average or running mean](https://stackoverflow.com/questions/13728392/moving-average-or-running-mean) – pppery Mar 05 '20 at 23:15

19 Answers19

256

If you just want a straightforward non-weighted moving average, you can easily implement it with np.cumsum, which may be is faster than FFT based methods:

EDIT Corrected an off-by-one wrong indexing spotted by Bean in the code. EDIT

def moving_average(a, n=3):
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n
    
>>> a = np.arange(20)
>>> moving_average(a)
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
        12.,  13.,  14.,  15.,  16.,  17.,  18.])
>>> moving_average(a, n=4)
array([  1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,   9.5,
        10.5,  11.5,  12.5,  13.5,  14.5,  15.5,  16.5,  17.5])

So I guess the answer is: it is really easy to implement, and maybe numpy is already a little bloated with specialized functionality.

Nopileos
  • 1,976
  • 7
  • 17
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • 1
    This is an interesting way of doing it, but honestly I wouldn't be able to come up with it. Whether it belongs in numpy is debatable, but certainly scipy could include it? – loopbackbee Jan 14 '13 at 20:10
  • 9
    Thanks for the bug checking, it now seems to be working fine. As for the upvotes, I am guessing the general idea behind the answer was weighed more heavily than an off-by-one error in the implementation, but who knows. – Jaime Aug 22 '13 at 19:33
  • 4
    I have found the problem. `ret[n:] -= ret[:-n]` is NOT THE SAME as `ret[n:] = ret[n:] - ret[:-n]`. I have fixed the code in this answer. Edit: Nope somebody else just beat me to it. – Timmmm Dec 04 '13 at 14:29
  • 8
    @Timmmm I did, that was indeed the issue. The general principle behind this answer is widely used in image processing (summed area tables they call it), so the issue had to be in the implementation. A good example of being bit by premature optimization, since I kind of recall doing the operation in-place "because it will be more efficient." On the bright side, it probably did produce the wrong answer faster... – Jaime Dec 04 '13 at 14:39
  • 2
    @askewchan In Timmmm's example, when doing `ret[2:] -= ret[:-2]`, `ret` is `[1, 2, 3, 4, 5, 6]`. When it processes the last data point it is basically doing `ret[5] -= ret[3]`. We would like to get `ret[5] = 6 - 4`, but we get `ret[5] = 6 - 2` instead, because `ret[3]` has already been modified by the in-place operation to be `ret[3] -= ret[1]`, i.e. `2`. If you don't do the operation in-place it creates a temporary array and then copies it into `ret`, avoiding this problem. – Jaime Dec 04 '13 at 16:31
  • Of course, that makes sense. So `ret[:-2] -= ret[2:]` will behave the same as `a[:-2] = a[:-2] - a[2:]` since it looks to the _right_. – askewchan Dec 04 '13 at 16:35
  • Well, there's the issue of how numpy will access memory, which I am afraid is an implementation detail. It seems to make sense that memory is scanned linearly in increasing address order, but I am not sure there is a guarantee on it not changing in the future. For numpy 1.8 it is true, even if you take a view with negative strides, i.e. if you do `ret = ret[::-1]` before doing the in-place subtraction, it is still `ret[:-2] -= ret[2:]` that gives the correct result, but not `ret[2:] -= ret[:-2]`, which means numpy is accessing memory in decreasing address order... – Jaime Dec 04 '13 at 16:46
  • I agree that a np.convolve will be much faster. – Yuxiang Wang Apr 30 '14 at 00:12
  • No, it is not faster, maybe it is more convenient, but it is slower, much slower if the window is large. – Jaime Apr 30 '14 at 02:11
  • What about `np.correlate(a, [0.25, 0.25, 0.25, 0.25])`, this also extends to a weighted moving average. `np.correlate(a, [0.1, 0.2, 0.3, 0.4])` – Akavall Apr 30 '14 at 02:33
  • 1
    @Akavall If you need a weighted moving average, then yes, probably `correlate` is the way to go. But because `correlate` uses a generic algorithm, if your window is of size `m`, at every point it must perform `m` multiplications and `m-1` additions. The method here, or what `scipy.filter.uniform` uses, only needs a multiplication, an addition and a subtraction. – Jaime Apr 30 '14 at 12:39
  • just a quick practical question: which end of my time vector (time in seconds at each index) am I wanting to shrink after applying this to a signal vector? n=3 removes 2 entries, n=4 removes 3, etc. my intuition is that i should remove the time from the end (right side) of the vector. – user391339 Sep 16 '14 at 00:41
  • 2
    @user391339 It depends on what you are after. If you remove times from the end, your entries will be averaging into the future. If you remove times from the beginning, they will be an average of past values. And you can even remove half and half, and have a centered average. – Jaime Sep 16 '14 at 07:33
  • 63
    Hmmm, it seems this "easy to implement" function is actually pretty easy to get wrong and has fostered a good discussion on memory efficiency. I'm happy to have bloat if it means knowing that something's been done right. – Richard Sep 20 '14 at 19:23
  • @Jaime. Am I missing something, or do you need a line to the effect of `ret[:n] /= np.arange(n)`? (I am pretty sure `/=` is OK here, unlike the `-=` before) – Mad Physicist Jun 13 '17 at 18:33
  • ValueError: the 'dtype' parameter is not supported in the pandas implementation of cumsum() – JavaRunner Sep 20 '17 at 20:16
  • Seems `np.cumsum` solution is less numerically stable than `np.convolve` solution. – mrgloom Nov 02 '18 at 15:45
  • For me in python3 `n = 3; x = np.arange(5); x[n:] -= x[:-n]; print(x)` and `n = 3; x = np.arange(5); x[n:] = x[n:] - x[:-n]; print(x)` gives the same result `[0 1 2 3 3]`, does it depends on numpy version? – mrgloom Nov 03 '18 at 19:27
  • 1
    This answer corresponds to `pd.Series(np.arange(20)).rolling(window=4, center=True).mean().dropna()`. More useful would be to have the left and right sides remain to not get shrunken data e.g. `pd.Series(np.arange(20)).rolling(window=4, center=True, min_periods=1).mean()`. The left and right-sided variations are also interesting. Would be nice to see the generalized form using numpy. Also note this version is subject to integer/floating pointer overflows and is for that reason not necessarily ideal despite being very efficient when that does not happen. – Gregory Morse Dec 04 '19 at 06:09
  • 1
    All, This is not working right. Just run a test moving average for yourself. The resulting vector is shifted left by n/2. – Valentyn May 01 '20 at 11:54
  • I think the return statement should be `return np.concatenate((np.zeros(n-1), ret[n - 1:] / n))` so the MA is aligned with original data. – Ali Pardhan Jul 16 '20 at 23:00
  • 3
    Could/Should this answer be updated to include [`numpy.lib.stride_tricks.sliding_window_view`](https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.sliding_window_view.html) that was introduced in numpy v1.20? The solution becomes a one-liner: `sliding_window_view(array, window_size).mean(axis=-1)`. Speed might be comparable to `cumsum`, but I am not 100% sure. – FirefoxMetzger Oct 23 '21 at 09:52
241

A simple way to achieve this is by using np.convolve. The idea behind this is to leverage the way the discrete convolution is computed and use it to return a rolling mean. This can be done by convolving with a sequence of np.ones of a length equal to the sliding window length we want.

In order to do so we could define the following function:

def moving_average(x, w):
    return np.convolve(x, np.ones(w), 'valid') / w

This function will be taking the convolution of the sequence x and a sequence of ones of length w. Note that the chosen mode is valid so that the convolution product is only given for points where the sequences overlap completely.


Some examples:

x = np.array([5,3,8,10,2,1,5,1,0,2])

For a moving average with a window of length 2 we would have:

moving_average(x, 2)
# array([4. , 5.5, 9. , 6. , 1.5, 3. , 3. , 0.5, 1. ])

And for a window of length 4:

moving_average(x, 4)
# array([6.5 , 5.75, 5.25, 4.5 , 2.25, 1.75, 2.  ])

How does convolve work?

Lets have a more in depth look at the way the discrete convolution is being computed. The following function aims to replicate the way np.convolve is computing the output values:

def mov_avg(x, w):
    for m in range(len(x)-(w-1)):
        yield sum(np.ones(w) * x[m:m+w]) / w 

Which, for the same example above would also yield:

list(mov_avg(x, 2))
# [4.0, 5.5, 9.0, 6.0, 1.5, 3.0, 3.0, 0.5, 1.0]

So what is being done at each step is to take the inner product between the array of ones and the current window. In this case the multiplication by np.ones(w) is superfluous given that we are directly taking the sum of the sequence.

Bellow is an example of how the first outputs are computed so that it is a little clearer. Lets suppose we want a window of w=4:

[1,1,1,1]
[5,3,8,10,2,1,5,1,0,2]
= (1*5 + 1*3 + 1*8 + 1*10) / w = 6.5

And the following output would be computed as:

  [1,1,1,1]
[5,3,8,10,2,1,5,1,0,2]
= (1*3 + 1*8 + 1*10 + 1*2) / w = 5.75

And so on, returning a moving average of the sequence once all overlaps have been performed.

yatu
  • 86,083
  • 12
  • 84
  • 139
  • 2
    This is a nice idea! It is faster than @Jaime's answer for small n, but becomes slower for larger n. – Felipe Gerard May 07 '19 at 14:06
  • 9
    Sometimes it's useful to have output array the same size as the input. For this the `mode='valid'` can be replaced with `'same'`. Just in this case edge points will gravitate towards zero. – Ilia Barahovsky Mar 01 '20 at 20:31
  • In a situation where some elements of the array 'x' of the function may be None or zero, how do you get the corresponding 'x' values of the returned values from this function? The size of the returned array from this function can be smaller than array 'x' supplied to it. – Sun Bear Aug 24 '20 at 16:45
  • 5
    Should it be clarified that this method would yield the "centered moving average" instead of the "simple moving average" often calculated for financial applications? https://en.wikipedia.org/wiki/Moving_average – CraigDavid May 04 '21 at 02:43
  • 1
    Actually you are just doing average pooling with stride=1, not an actual convolution. – Anonymous Jun 08 '22 at 23:45
87

NumPy's lack of a particular domain-specific function is perhaps due to the Core Team's discipline and fidelity to NumPy's prime directive: provide an N-dimensional array type, as well as functions for creating, and indexing those arrays. Like many foundational objectives, this one is not small, and NumPy does it brilliantly.

The (much) larger SciPy contains a much larger collection of domain-specific libraries (called subpackages by SciPy devs)--for instance, numerical optimization (optimize), signal processsing (signal), and integral calculus (integrate).

My guess is that the function you are after is in at least one of the SciPy subpackages (scipy.signal perhaps); however, i would look first in the collection of SciPy scikits, identify the relevant scikit(s) and look for the function of interest there.

Scikits are independently developed packages based on NumPy/SciPy and directed to a particular technical discipline (e.g., scikits-image, scikits-learn, etc.) Several of these were (in particular, the awesome OpenOpt for numerical optimization) were highly regarded, mature projects long before choosing to reside under the relatively new scikits rubric. The Scikits homepage liked to above lists about 30 such scikits, though at least several of those are no longer under active development.

Following this advice would lead you to scikits-timeseries; however, that package is no longer under active development; In effect, Pandas has become, AFAIK, the de facto NumPy-based time series library.

Pandas has several functions that can be used to calculate a moving average; the simplest of these is probably rolling_mean, which you use like so:

>>> # the recommended syntax to import pandas
>>> import pandas as PD
>>> import numpy as NP

>>> # prepare some fake data:
>>> # the date-time indices:
>>> t = PD.date_range('1/1/2010', '12/31/2012', freq='D')

>>> # the data:
>>> x = NP.arange(0, t.shape[0])

>>> # combine the data & index into a Pandas 'Series' object
>>> D = PD.Series(x, t)

Now, just call the function rolling_mean passing in the Series object and a window size, which in my example below is 10 days.

>>> d_mva = PD.rolling_mean(D, 10)

>>> # d_mva is the same size as the original Series
>>> d_mva.shape
    (1096,)

>>> # though obviously the first w values are NaN where w is the window size
>>> d_mva[:3]
    2010-01-01         NaN
    2010-01-02         NaN
    2010-01-03         NaN

verify that it worked--e.g., compared values 10 - 15 in the original series versus the new Series smoothed with rolling mean

>>> D[10:15]
     2010-01-11    2.041076
     2010-01-12    2.041076
     2010-01-13    2.720585
     2010-01-14    2.720585
     2010-01-15    3.656987
     Freq: D

>>> d_mva[10:20]
      2010-01-11    3.131125
      2010-01-12    3.035232
      2010-01-13    2.923144
      2010-01-14    2.811055
      2010-01-15    2.785824
      Freq: D

The function rolling_mean, along with about a dozen or so other function are informally grouped in the Pandas documentation under the rubric moving window functions; a second, related group of functions in Pandas is referred to as exponentially-weighted functions (e.g., ewma, which calculates exponentially moving weighted average). The fact that this second group is not included in the first (moving window functions) is perhaps because the exponentially-weighted transforms don't rely on a fixed-length window

doug
  • 69,080
  • 24
  • 165
  • 199
  • 9
    Pandas does have a strong line up of moving window functions. But it seems to me like a little too much overhead for a simple moving average. – Jaime Jan 14 '13 at 07:31
  • 7
    well i doubt that calculating a moving average is an isolated requirement for the OP or for just about anyone else. If you need to calculate a moving average then you almost certainly have a time series, which means you need a data structure that allows you to conform a date-time index to your data and that's the 'overhead' you refer to. – doug Jan 14 '13 at 07:45
  • 3
    Firstly, thank you for taking the time to write this extremely informative answer. Indeed, I can't see a use for a moving average that doesn't involve a time series. But that doesn't mean that one needs to conform it to a datetime, or even to a particular sampling interval (it may be unknown) – loopbackbee Jan 14 '13 at 20:02
  • I agree. There could be a scikit blob that expects you to provide a properly built time-series before you feed it data. – RobotHumans Jan 25 '14 at 16:29
  • @goncalopp Where are you getting the s in D = PD.Series(s, t) from? I'm getting a NameError that it's not defined – Force444 Mar 01 '14 at 21:25
  • It should be x instead I think – Force444 Mar 01 '14 at 21:30
  • 3
    Just wanted to add that the moving average function has been extracted into the [Bottleneck](http://berkeleyanalytics.com/bottleneck/) library if pandas seems too heavyweight as a dependency. – robochat Jul 16 '14 at 12:38
  • 6
    'rolling_mean' is not part pf pandas anymore, please see reply using 'rolling' instead – Vladtn Dec 16 '18 at 17:28
  • "the recommended syntax to import pandas" citation needed. I've literally never seen it in upper case (nor "NP"). – Trang Oul Aug 30 '21 at 05:20
52

Here are a variety of ways to do this, along with some benchmarks. The best methods are versions using optimized code from other libraries. The bottleneck.move_mean method is probably best all around. The scipy.convolve approach is also very fast, extensible, and syntactically and conceptually simple, but doesn't scale well for very large window values. The numpy.cumsum method is good if you need a pure numpy approach.

Note: Some of these (e.g. bottleneck.move_mean) are not centered, and will shift your data.

import numpy as np
import scipy as sci
import scipy.signal as sig
import pandas as pd
import bottleneck as bn
import time as time

def rollavg_direct(a,n): 
    'Direct "for" loop'
    assert n%2==1
    b = a*0.0
    for i in range(len(a)) :
        b[i]=a[max(i-n//2,0):min(i+n//2+1,len(a))].mean()
    return b

def rollavg_comprehension(a,n):
    'List comprehension'
    assert n%2==1
    r,N = int(n/2),len(a)
    return np.array([a[max(i-r,0):min(i+r+1,N)].mean() for i in range(N)]) 

def rollavg_convolve(a,n):
    'scipy.convolve'
    assert n%2==1
    return sci.convolve(a,np.ones(n,dtype='float')/n, 'same')[n//2:-n//2+1]  

def rollavg_convolve_edges(a,n):
    'scipy.convolve, edge handling'
    assert n%2==1
    return sci.convolve(a,np.ones(n,dtype='float'), 'same')/sci.convolve(np.ones(len(a)),np.ones(n), 'same')  

def rollavg_cumsum(a,n):
    'numpy.cumsum'
    assert n%2==1
    cumsum_vec = np.cumsum(np.insert(a, 0, 0)) 
    return (cumsum_vec[n:] - cumsum_vec[:-n]) / n

def rollavg_cumsum_edges(a,n):
    'numpy.cumsum, edge handling'
    assert n%2==1
    N = len(a)
    cumsum_vec = np.cumsum(np.insert(np.pad(a,(n-1,n-1),'constant'), 0, 0)) 
    d = np.hstack((np.arange(n//2+1,n),np.ones(N-n)*n,np.arange(n,n//2,-1)))  
    return (cumsum_vec[n+n//2:-n//2+1] - cumsum_vec[n//2:-n-n//2]) / d

def rollavg_roll(a,n):
    'Numpy array rolling'
    assert n%2==1
    N = len(a)
    rolling_idx = np.mod((N-1)*np.arange(n)[:,None] + np.arange(N), N)
    return a[rolling_idx].mean(axis=0)[n-1:] 

def rollavg_roll_edges(a,n):
    # see https://stackoverflow.com/questions/42101082/fast-numpy-roll
    'Numpy array rolling, edge handling'
    assert n%2==1
    a = np.pad(a,(0,n-1-n//2), 'constant')*np.ones(n)[:,None]
    m = a.shape[1]
    idx = np.mod((m-1)*np.arange(n)[:,None] + np.arange(m), m) # Rolling index
    out = a[np.arange(-n//2,n//2)[:,None], idx]
    d = np.hstack((np.arange(1,n),np.ones(m-2*n+1+n//2)*n,np.arange(n,n//2,-1)))
    return (out.sum(axis=0)/d)[n//2:]

def rollavg_pandas(a,n):
    'Pandas rolling average'
    return pd.DataFrame(a).rolling(n, center=True, min_periods=1).mean().to_numpy()

def rollavg_bottlneck(a,n):
    'bottleneck.move_mean'
    return bn.move_mean(a, window=n, min_count=1)

N = 10**6
a = np.random.rand(N)
functions = [rollavg_direct, rollavg_comprehension, rollavg_convolve, 
        rollavg_convolve_edges, rollavg_cumsum, rollavg_cumsum_edges, 
        rollavg_pandas, rollavg_bottlneck, rollavg_roll, rollavg_roll_edges]

print('Small window (n=3)')
%load_ext memory_profiler
for f in functions : 
    print('\n'+f.__doc__+ ' : ')
    %timeit b=f(a,3)

print('\nLarge window (n=1001)')
for f in functions[0:-2] : 
    print('\n'+f.__doc__+ ' : ')
    %timeit b=f(a,1001)

print('\nMemory\n')
print('Small window (n=3)')
N = 10**7
a = np.random.rand(N)
%load_ext memory_profiler
for f in functions[2:] : 
    print('\n'+f.__doc__+ ' : ')
    %memit b=f(a,3)

print('\nLarge window (n=1001)')
for f in functions[2:-2] : 
    print('\n'+f.__doc__+ ' : ')
    %memit b=f(a,1001)

Timing, Small window (n=3)

Direct "for" loop : 

4.14 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehension : 
3.96 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.convolve : 
1.07 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

scipy.convolve, edge handling : 
4.68 ms ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum : 
5.31 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum, edge handling : 
8.52 ms ± 11.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas rolling average : 
9.85 ms ± 9.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bottleneck.move_mean : 
1.3 ms ± 12.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Numpy array rolling : 
31.3 ms ± 91.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy array rolling, edge handling : 
61.1 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing, Large window (n=1001)

Direct "for" loop : 
4.67 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

List comprehension : 
4.46 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

scipy.convolve : 
103 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

scipy.convolve, edge handling : 
272 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

numpy.cumsum : 
5.19 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numpy.cumsum, edge handling : 
8.7 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pandas rolling average : 
9.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

bottleneck.move_mean : 
1.31 ms ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Memory, Small window (n=3)

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler

scipy.convolve : 
peak memory: 362.66 MiB, increment: 73.61 MiB

scipy.convolve, edge handling : 
peak memory: 510.24 MiB, increment: 221.19 MiB

numpy.cumsum : 
peak memory: 441.81 MiB, increment: 152.76 MiB

numpy.cumsum, edge handling : 
peak memory: 518.14 MiB, increment: 228.84 MiB

Pandas rolling average : 
peak memory: 449.34 MiB, increment: 160.02 MiB

bottleneck.move_mean : 
peak memory: 374.17 MiB, increment: 75.54 MiB

Numpy array rolling : 
peak memory: 661.29 MiB, increment: 362.65 MiB

Numpy array rolling, edge handling : 
peak memory: 1111.25 MiB, increment: 812.61 MiB

Memory, Large window (n=1001)

scipy.convolve : 
peak memory: 370.62 MiB, increment: 71.83 MiB

scipy.convolve, edge handling : 
peak memory: 521.98 MiB, increment: 223.18 MiB

numpy.cumsum : 
peak memory: 451.32 MiB, increment: 152.52 MiB

numpy.cumsum, edge handling : 
peak memory: 527.51 MiB, increment: 228.71 MiB

Pandas rolling average : 
peak memory: 451.25 MiB, increment: 152.50 MiB

bottleneck.move_mean : 
peak memory: 374.64 MiB, increment: 75.85 MiB
argentum2f
  • 4,842
  • 2
  • 25
  • 30
32

Starting in Numpy 1.20, the sliding_window_view provides a way to slide/roll through windows of elements. Windows that you can then individually average.

For instance, for a 4-element window:

from numpy.lib.stride_tricks import sliding_window_view

# values = np.array([5, 3, 8, 10, 2, 1, 5, 1, 0, 2])
np.average(sliding_window_view(values, window_shape = 4), axis=1)
# array([6.5, 5.75, 5.25, 4.5, 2.25, 1.75, 2])

Note the intermediary result of sliding_window_view:

# values = np.array([5, 3, 8, 10, 2, 1, 5, 1, 0, 2])
sliding_window_view(values, window_shape = 4)
# array([[ 5,  3,  8, 10],
#        [ 3,  8, 10,  2],
#        [ 8, 10,  2,  1],
#        [10,  2,  1,  5],
#        [ 2,  1,  5,  1],
#        [ 1,  5,  1,  0],
#        [ 5,  1,  0,  2]])
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
  • 3
    Note that while it is simple, this method is significantly slower than the other methods listed here - in my tests it came out around 80 times slower than bottleneck.move_mean, 15 times slower than numpy.cumsum – Voy Aug 05 '22 at 09:10
11

This answer using Pandas is adapted from above, as rolling_mean is not part of Pandas anymore

# the recommended syntax to import pandas
import pandas as pd
import numpy as np

# prepare some fake data:
# the date-time indices:
t = pd.date_range('1/1/2010', '12/31/2012', freq='D')

# the data:
x = np.arange(0, t.shape[0])

# combine the data & index into a Pandas 'Series' object
D = pd.Series(x, t)

Now, just call the function rolling on the dataframe with a window size, which in my example below is 10 days.

d_mva10 = D.rolling(10).mean()

# d_mva is the same size as the original Series
# though obviously the first w values are NaN where w is the window size
d_mva10[:11]

2010-01-01    NaN
2010-01-02    NaN
2010-01-03    NaN
2010-01-04    NaN
2010-01-05    NaN
2010-01-06    NaN
2010-01-07    NaN
2010-01-08    NaN
2010-01-09    NaN
2010-01-10    4.5
2010-01-11    5.5
Freq: D, dtype: float64
Vladtn
  • 2,506
  • 3
  • 27
  • 23
9

I feel this can be easily solved using bottleneck

See basic sample below:

import numpy as np
import bottleneck as bn

a = np.random.randint(4, 1000, size=(5, 7))
mm = bn.move_mean(a, window=2, min_count=1)

This gives move mean along each axis.

  • "mm" is the moving mean for "a".

  • "window" is the max number of entries to consider for moving mean.

  • "min_count" is min number of entries to consider for moving mean (e.g. for first element or if the array has nan values).

The good part is Bottleneck helps to deal with nan values and it's also very efficient.

Anthony Anyanwu
  • 987
  • 11
  • 7
  • Bottleneck is a nice quick and easy solution to implement (upvote), but good to emphasize that it only offers "lagged" running means, not centered, as also pointed out by argentum2f – ClimateUnboxed Sep 29 '21 at 07:28
7

in case someone needs a simple solution, here's one

def moving_average(a,n):
    N=len(a)
    return np.array([np.mean(a[i:i+n]) for i in np.arange(0,N-n+1)])

you can change the overlap between windows by adding the step argument in np.arange(0,N-n+1,step)

Kaoutar
  • 345
  • 2
  • 10
3

In case you want to take care the edge conditions carefully (compute mean only from available elements at edges), the following function will do the trick.

import numpy as np

def running_mean(x, N):
    out = np.zeros_like(x, dtype=np.float64)
    dim_len = x.shape[0]
    for i in range(dim_len):
        if N%2 == 0:
            a, b = i - (N-1)//2, i + (N-1)//2 + 2
        else:
            a, b = i - (N-1)//2, i + (N-1)//2 + 1

        #cap indices to min and max indices
        a = max(0, a)
        b = min(dim_len, b)
        out[i] = np.mean(x[a:b])
    return out

>>> running_mean(np.array([1,2,3,4]), 2)
array([1.5, 2.5, 3.5, 4. ])

>>> running_mean(np.array([1,2,3,4]), 3)
array([1.5, 2. , 3. , 3.5])
Peixiang Zhong
  • 419
  • 4
  • 8
3

All the answers seem to focus on the case of a precomputed list. For the actual running usecase, where the numbers come in one by one, here is a simple class that provides the service of averaging the last N values:

import numpy as np
class RunningAverage():
    def __init__(self, stack_size):
        self.stack = [0 for _ in range(stack_size)]
        self.ptr = 0
        self.full_cycle = False
    def add(self,value):
        self.stack[self.ptr] = value
        self.ptr += 1
        if self.ptr == len(self.stack):
            self.full_cycle = True
            self.ptr = 0
    def get_avg(self):
        if self.full_cycle:
            return np.mean(self.stack)
        else:
            return np.mean(self.stack[:self.ptr])

usage:

N = 50  # size of the averaging window
run_avg = RunningAverage(N)
for i in range(1000):
    value = <my computation>
    run_avg.add(value)
    if i % 20 ==0: # print once in 20 iters:
        print(f'the average value is {run_avg.get_avg()}')
2

You can also write your own Python C extension.

This is certainly not the easiest way, but will allow you to go faster and be more memory efficient than using np.cumsum as a building block.

// moving_average.c
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#include <Python.h>
#include <numpy/arrayobject.h>

static PyObject *moving_average(PyObject *self, PyObject *args) {
    PyObject *input;
    int64_t window_size;
    PyArg_ParseTuple(args, "Ol", &input, &window_size);
    if (PyErr_Occurred()) return NULL;
    if (!PyArray_Check(input) || !PyArray_ISNUMBER((PyArrayObject *)input)) {
        PyErr_SetString(PyExc_TypeError, "First argument must be a numpy array with numeric dtype");
        return NULL;
    }
    
    int64_t input_size = PyObject_Size(input);
    double *input_data;
    if (PyArray_AsCArray(&input, &input_data, (npy_intp[]){ [0] = input_size }, 1, PyArray_DescrFromType(NPY_DOUBLE)) != 0) {
        PyErr_SetString(PyExc_TypeError, "Failed to simulate C array of type double");
        return NULL;
    }
    
    int64_t output_size = input_size - window_size + 1;
    PyObject *output = PyArray_SimpleNew(1, (npy_intp[]){ [0] = output_size }, NPY_DOUBLE);
    double *output_data = PyArray_DATA((PyArrayObject *)output);
    
    double cumsum_before = 0;
    double cumsum_after = 0;
    for (int i = 0; i < window_size; ++i) {
        cumsum_after += input_data[i];
    }
    for (int i = 0; i < output_size - 1; ++i) {
        output_data[i] = (cumsum_after - cumsum_before) / window_size;
        cumsum_after += input_data[i + window_size];
        cumsum_before += input_data[i];
    }
    output_data[output_size - 1] = (cumsum_after - cumsum_before) / window_size;

    return output;
}

static PyMethodDef methods[] = {
    {
        "moving_average", 
        moving_average, 
        METH_VARARGS, 
        "Rolling mean of numpy array with specified window size"
    },
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef moduledef = {
    PyModuleDef_HEAD_INIT,
    "moving_average",
    "C extension for finding the rolling mean of a numpy array",
    -1,
    methods
};

PyMODINIT_FUNC PyInit_moving_average(void) {
    PyObject *module = PyModule_Create(&moduledef);
    import_array();
    return module;
}
  • METH_VARARGS specifies that the method only takes positional arguments. PyArg_ParseTuple allows you to parse these positional arguments.

  • By using PyErr_SetString and returning NULL from the method, you can signal that an exception has occurred to the Python interpreter from the C extension.

  • PyArray_AsCArray allows your method to be polymorphic when it comes to input array dtype, alignment, whether the array is C-contiguous (See "Can a numpy 1d array not be contiguous?") etc. without needing to create a copy of the array. If you instead used PyArray_DATA, you'd need to deal with this yourself.

  • PyArray_SimpleNew allows you to create a new numpy array. This is similar to using np.empty. The array will not be initialized, and might contain non-deterministic junk which could surprise you if you forget to overwrite it.

Building the C extension

# setup.py
from setuptools import setup, Extension
import numpy

setup(
  ext_modules=[
    Extension(
      'moving_average',
      ['moving_average.c'],
      include_dirs=[numpy.get_include()]
    )
  ]
)

# python setup.py build_ext --build-lib=.

Benchmarks

import numpy as np

# Our compiled C extension:
from moving_average import moving_average as moving_average_c

# Answer by Jaime using npcumsum
def moving_average_cumsum(a, n) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

# Answer by yatu using np.convolve
def moving_average_convolve(a, n):
    return np.convolve(a, np.ones(n), 'valid') / n

a = np.random.rand(1_000_000)
print('window_size = 3')
%timeit moving_average_c(a, 3)
%timeit moving_average_cumsum(a, 3)
%timeit moving_average_convolve(a, 3)

print('\nwindow_size = 100')
%timeit moving_average_c(a, 100)
%timeit moving_average_cumsum(a, 100)
%timeit moving_average_convolve(a, 100)
window_size = 3
958 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
4.52 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
809 µs ± 463 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

window_size = 100
977 µs ± 937 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
6.16 ms ± 19.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.2 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Tobias Bergkvist
  • 1,751
  • 16
  • 20
  • 1
    This is impressive but unless the runtime of this is absolutely critical in your codebase, it would be better to use one of the simpler solutions offered that stays within Python. – WolfLink Nov 08 '22 at 02:40
1

talib contains a simple moving average tool, as well as other similar averaging tools (i.e. exponential moving average). Below compares the method to some of the other solutions.


%timeit pd.Series(np.arange(100000)).rolling(3).mean()
2.53 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit talib.SMA(real = np.arange(100000.), timeperiod = 3)
348 µs ± 3.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit moving_average(np.arange(100000))
638 µs ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

One caveat is that the real must have elements of dtype = float. Otherwise the following error is raised

Exception: real is not double

Josmoor98
  • 1,721
  • 10
  • 27
0

moving average

iterator method

  • reverse the array at i, and simply take the mean from i to n.

  • use list comprehension to generate mini arrays on the fly.

x = np.random.randint(10, size=20)

def moving_average(arr, n):
    return [ (arr[:i+1][::-1][:n]).mean() for i, ele in enumerate(arr) ]
d = 5

moving_average(x, d)

tensor convolution

moving_average = np.convolve(x, np.ones(d)/d, mode='valid')
inivri
  • 29
  • 5
  • 1
    thank you for the two solutions. I run both with the example data `x` but it is giving different results. Could you explain why? – Ruthger Righart Nov 24 '20 at 08:10
0

I use either the accepted answer's solution, slightly modified to have same length for output as input, or pandas' version as mentioned in a comment of another answer. I summarize both here with a reproducible example for future reference:

import numpy as np
import pandas as pd

def moving_average(a, n):
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret / n

def moving_average_centered(a, n):
    return pd.Series(a).rolling(window=n, center=True).mean().to_numpy()

A = [0, 0, 1, 2, 4, 5, 4]
print(moving_average(A, 3))    
# [0.         0.         0.33333333 1.         2.33333333 3.66666667 4.33333333]
print(moving_average_centered(A, 3))
# [nan        0.33333333 1.         2.33333333 3.66666667 4.33333333 nan       ]
Basj
  • 41,386
  • 99
  • 383
  • 673
-1
for i in range(len(Data)):
    Data[i, 1] = Data[i-lookback:i, 0].sum() / lookback

Try this piece of code. I think it's simpler and does the job. lookback is the window of the moving average.

In the Data[i-lookback:i, 0].sum() I have put 0 to refer to the first column of the dataset but you can put any column you like in case you have more than one column.

-1

By comparing the solution below with the one that uses cumsum of numpy, This one takes almost half the time. This is because it does not need to go through the entire array to do the cumsum and then do all the subtraction. Moreover, the cumsum can be "dangerous" if the array is huge and the number are huge (possible overflow). Of course, also here the danger exists but at least are summed together only the essential numbers.

def moving_average(array_numbers, n):
    if n > len(array_numbers):
      return []
    temp_sum = sum(array_numbers[:n])
    averages = [temp_sum / float(n)]
    for first_index, item in enumerate(array_numbers[n:]):
        temp_sum += item - array_numbers[first_index]
        averages.append(temp_sum / float(n))
    return averages
-2

I actually wanted a slightly different behavior than the accepted answer. I was building a moving average feature extractor for an sklearn pipeline, so I required that the output of the moving average have the same dimension as the input. What I want is for the moving average to assume the series stays constant, ie a moving average of [1,2,3,4,5] with window 2 would give [1.5,2.5,3.5,4.5,5.0].

For column vectors (my use case) we get

def moving_average_col(X, n):
  z2 = np.cumsum(np.pad(X, ((n,0),(0,0)), 'constant', constant_values=0), axis=0)
  z1 = np.cumsum(np.pad(X, ((0,n),(0,0)), 'constant', constant_values=X[-1]), axis=0)
  return (z1-z2)[(n-1):-1]/n

And for arrays

def moving_average_array(X, n):
  z2 = np.cumsum(np.pad(X, (n,0), 'constant', constant_values=0))
  z1 = np.cumsum(np.pad(X, (0,n), 'constant', constant_values=X[-1]))
  return (z1-z2)[(n-1):-1]/n

Of course, one doesn't have to assume constant values for the padding, but doing so should be adequate in most cases.

cbartondock
  • 663
  • 6
  • 17
-2

Here is a fast implementation using numba (mind the types). Note it does contain nans where shifted.

import numpy as np
import numba as nb

@nb.jit(nb.float64[:](nb.float64[:],nb.int64),
        fastmath=True,nopython=True)
def moving_average( array, window ):    
    ret = np.cumsum(array)
    ret[window:] = ret[window:] - ret[:-window]
    ma = ret[window - 1:] / window
    n = np.empty(window-1); n.fill(np.nan)
    return np.concatenate((n.ravel(), ma.ravel())) 
-2

If you already have an array of known size

import numpy as np                                         
M=np.arange(12)
                                                               
avg=[]                                                         
i=0
while i<len(M)-2: #for n point average len(M) - (n-1)
        avg.append((M[i]+M[i+1]+M[i+2])/3) #n is denominator                       
        i+=1     
                                                                                                    
print(avg)
nightcrawler
  • 275
  • 5
  • 16