Way of easily finding the average of every nth element over a window of size k in a pandas.Series? (not the rolling mean)

Question

The motivation here is to take a time series and get the average activity throughout a sub-period (day, week).

It is possible to reshape an array and take the mean over the y axis to achieve this, similar to this answer (but using axis=2):

Averaging over every n elements of a numpy array

but I'm looking for something which can handle arrays of length N%k != 0 and does not solve the issue by reshaping and padding with ones or zeros (e.g numpy.resize), i.e takes the average over the existing data only.

E.g Start with a sequence [2,2,3,2,2,3,2,2,3,6] of length N=10 which is not divisible by k=3. What I want is to take the average over columns of a reshaped array with mis-matched dimensions:

In: [[2,2,3], [2,2,3], [2,2,3], [6]], k =3

Out: [3,2,3]

Instead of:

In: [[2,2,3], [2,2,3], [2,2,3], [6,0,0]], k =3

Out: [3,1.5,2.25]

Thank you.

Could you provide a complete example of what you're trying to do and what you've done? Your current example makes no sense. — Ilja Everilä, May 23 '16 at 08:38

Eric · Answer 1 · 2016-05-23T10:19:35.147

You can use a masked array to pad with special values that are ignored when finding the mean, instead of summing.

k = 3

# how long the array needs to be to be divisible by 3
padded_len = (len(in_arr) + (k - 1)) // k * k

# create a np.ma.MaskedArray with padded entries masked
padded = np.ma.empty(padded_len)
padded[:len(in_arr)] = in_arr
padded[len(in_arr):] = np.ma.masked

# now we can treat it an array divisible by k:
mean = padded.reshape((-1, k)).mean(axis=0)

# if you need to remove the masked-ness
assert not np.ma.is_masked(mean), "in_arr was too short to calculate all means"
mean = mean.data

Imanol Luengo · Accepted Answer · 2016-05-23T09:14:45.627

2

You can easily do it by padding, reshaping and calculating by how many elements to divide each row:

>>> import numpy as np
>>> a = np.array([2,2,3,2,2,3,2,2,3,6])
>>> k = 3

Pad data

>>> b = np.pad(a, (0, k - a.size%k), mode='constant').reshape(-1, k)
>>> b
array([[2, 2, 3],
       [2, 2, 3],
       [2, 2, 3],
       [6, 0, 0]])

Then create a mask:

>>> c = a.size // k # 3
>>> d = (np.arange(k) + c * k) < a.size # [True, False, False]

The first part of d will create an array that contains [9, 10, 11], and compare it to the size of a (10), generating the mentioned boolean mask.

And divide it:

>>> b.sum(0) / (c + 1.0 * d)
array([ 3.,  2.,  3.])

The above will divide the first column by 4 (c + 1 * True) and the rest by 3. This is vectorized numpy, thus, it scales very well to large arrays.

Everything can be written shorter, I just show all the steps to make it more clear.

edited May 23 '16 at 09:14

answered May 23 '16 at 08:57

Imanol Luengo

15,366
2
49
67

This pads with an entire row of zeros if `len(a)` is divisible by k – Eric May 23 '16 at 09:09
Why are you using `filled`? – Eric May 23 '16 at 09:12
@Eric just to get back a raw array, and not a `masked` array. And yep, it does add a row of zeros, but does not affect the result. Avoiding the row of zeros would require a couple more checks – Imanol Luengo May 23 '16 at 09:12
Right, but if `a = [1, 2], k=3`, a `masked` array is the correct result, because there is no value for `b[2]` – Eric May 23 '16 at 09:13

Moses Koledoye · Answer 3 · 2016-05-23T21:27:10.383

1

Flatten the list In by unpacking and chaining. Create a new list that arranges the flattened list lst by columns, then use the map function to calculate the average of each column:

from itertools import chain

In = [[2, 2, 3], [2, 2, 3], [2, 2, 3], [6]]

lst = chain(*In)
k = 3

In_by_cols = [lst[i::k] for i in range(k)]
# [[2, 2, 2, 6], [2, 2, 2], [3, 3, 3]]

Out  = map(lambda x: sum(x)/ float(len(x)), In_by_cols)
# [3.0, 2.0, 3.0]

Using float on the length of each sublist will provide a more accurate result on python 2.x as it won't do integer truncation.

edited May 23 '16 at 21:27

answered May 23 '16 at 08:55

Moses Koledoye

77,341
8
133
139

_"Using `float` on the length ... will provide a more accurate result"_ - by which you mean "won't do integer truncation on python 2". It works just fine without it on python 3, and arguably a better fix is to use `from __future__ import division` – Eric May 23 '16 at 21:15
yes, that would work, but I assumed a `python` tag refers exclusively to python 2.x. Will add that line: "won't do integer truncation on python 2". Thank you – Moses Koledoye May 23 '16 at 21:23

Way of easily finding the average of every nth element over a window of size k in a pandas.Series? (not the rolling mean)

3 Answers3