Fast way to take average of every N rows in a .npy array

Question

I have a very large masked NumPy array (originalArray) with many rows and two columns. I want take the average of every two rows in originalArray and build a newArray in which each row is the average of two rows in originalArray (so newArray has half as many rows as originalArray). This should be a simple thing to do, but the script below is EXTREMELY slow. Any advice from the community would be greatly appreciated.

newList = []
for i in range(0, originalArray.shape[0], 2):
    r = originalArray[i:i+2,:].mean(axis=0)
    newList.append(r)
newArray = np.asarray(newList)

There must be a more elegant way of doing this. Many thanks!

You want to apply a function to non-overlapping *windows* of a ```numpy``` array. Here are some SO links: [Using strides for an efficient moving average filter](http://stackoverflow.com/q/7542135/2823755), [Python - vectorizing a sliding window](http://stackoverflow.com/q/18424900/2823755) - a couple of the answers look relevant, [Divide an image into 5x5 blocks in python and compute histogram for each block](http://stackoverflow.com/a/22749434/2823755), [Elements arrangement in a numpy array](http://stackoverflow.com/q/23645484/2823755) - the ```extract_patches``` answer would work. — wwii, May 21 '15 at 16:55

swenzel · Accepted Answer · 2016-01-29T13:31:44.460

The mean of two values a and b is 0.5*(a+b)
Therefore you can do it like this:

newArray = 0.5*(originalArray[0::2] + originalArray[1::2])

It will sum up all two consecutive rows and in the end multiply every element by 0.5.

Since in the title you are asking for avg over N rows, here is a more general solution:

def groupedAvg(myArray, N=2):
    result = np.cumsum(myArray, 0)[N-1::N]/float(N)
    result[1:] = result[1:] - result[:-1]
    return result

The general form of the average over n elements is sum([x1,x2,...,xn])/n. The sum of elements m to m+n in vector v is the same as subtracting the m-1th element from the m+nth element of cumsum(v). Unless m is 0, in that case you don't subtract anything (result[0]).
That is what we take advantage of here. Also since everything is linear, it is not important where we divide by N, so we do it right at the beginning, but that is just a matter of taste.

If the last group has less than N elements, it will be ignored completely. If you don't want to ignore it, you have to treat the last group specially:

def avg(myArray, N=2):
    cum = np.cumsum(myArray,0)
    result = cum[N-1::N]/float(N)
    result[1:] = result[1:] - result[:-1]

    remainder = myArray.shape[0] % N
    if remainder != 0:
        if remainder < myArray.shape[0]:
            lastAvg = (cum[-1]-cum[-1-remainder])/float(remainder)
        else:
            lastAvg = cum[-1]/float(remainder)
        result = np.vstack([result, lastAvg])

    return result

I think you should remove that `-1` in the middle. It's skipping the last row. — Thijs van Dien, May 21 '15 at 16:39
Very clever algorithm! It gives wrong answer though when trying to average over a simple 1D array, e.g. [1,2,3,4,5,6,7]. The problem is the result[1:] -= result[:-1] line. If you change it to: result[1:] = result[1:] - result[:-1] it works perfectly. It seems these two are equivalent but as I [found out](http://stackoverflow.com/questions/35036126/difference-between-a-b-and-a-a-b-in-python) there is a crucial difference between them — iasonas, Jan 27 '16 at 15:02
@iasonas you're right, that's a pretty serious issue, also for 2D arrays... I'll change that, thank you! :) — swenzel, Jan 29 '16 at 13:27

Jona · Answer 2 · 2016-07-13T09:40:09.650

Your problem (average of every two rows with two columns):

>>> a = np.reshape(np.arange(12),(6,2))
>>> a
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])
>>> a.transpose().reshape(-1,2).mean(1).reshape(2,-1).transpose()
array([[  1.,   2.],
       [  5.,   6.],
       [  9.,  10.]])

Other dimensions (average of every four rows with three columns):

>>> a = np.reshape(np.arange(24),(8,3))
>>> a
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23]])
>>> a.transpose().reshape(-1,4).mean(1).reshape(3,-1).transpose()
array([[  4.5,   5.5,   6.5],
       [ 16.5,  17.5,  18.5]])

General formula for taking the average of r rows for a 2D array a with c columns:

a.transpose().reshape(-1,r).mean(1).reshape(c,-1).transpose()

As a note to others looking around, this answer only works when a.shape[0] % r == 0 — Will.Evo, Sep 13 '19 at 22:29

farhawa · Answer 3 · 2015-05-21T16:42:02.497

0

import numpy as np

def av(array):
    return  1. * np.sum(array.reshape(1. * array.shape[0] / 2,2, array.shape[1]),axis = 1) / array.shape[1]

a = np.array([[1,1],[2,2],[3,3],[4,4]])

print av(a)

>> [[ 1.5  1.5] [ 3.5  3.5]]

edited May 21 '15 at 16:42

answered May 21 '15 at 16:29

farhawa

10,120
16
49
91

Thanks, but what I had in mind was the following: `a = np.array([[1,1],[2,2],[3,3],[4,4]])` with the result as `r = ([[1.5,1.5],[3.5,3.5]])` – Emily May 21 '15 at 16:32

Fast way to take average of every N rows in a .npy array

3 Answers3

Linked