numpy, get maximum of subsets

Question

I have an array of values, said v, (e.g. v=[1,2,3,4,5,6,7,8,9,10]) and an array of indexes, say g (e.g. g=[0,0,0,0,1,1,1,1,2,2]).

I know, for instance, how to take the first element of each group, in a very numpythonic way, doing:

import numpy as np
v=np.array([1,2,3,4,74,73,72,71,9,10])
g=np.array([0,0,0,0,1,1,1,1,2,2])
mask=np.concatenate(([True],np.diff(g)!=0))
v[mask]

returns:

array([1, 74, 9])

Is there any numpythonic way (avoiding explicit loops) to get the maximum of each subset?

Tests:

Since I received two good answers, one with the python map and one with a numpy routine, and I was searching the most performing, here some timing tests:

import numpy as np
import time
N=10000000
v=np.arange(N)
Nelemes_per_group=10
Ngroups=N/Nelemes_per_group
s=np.arange(Ngroups)
g=np.repeat(s,Nelemes_per_group)

start1=time.time()
r=np.maximum.reduceat(v, np.unique(g, return_index=True)[1])
end1=time.time()
print('END first method, T=',(end1-start1),'s')

start3=time.time()
np.array(list(map(np.max,np.split(v,np.where(np.diff(g)!=0)[0]+1))))
end3=time.time()
print('END second method,  (map returns an iterable) T=',(end3-start3),'s')

As a result I get:

END first method, T= 1.6057236194610596 s
END second method,  (map returns an iterable) T= 8.346540689468384 s

Interestingly, most of the slowdown of the map method is due to the list() call. If I do not try to reconvert my map result to a list ( but I have to, because python3.x returns an iterator: https://docs.python.org/3/library/functions.html#map )

Alex Riley · Accepted Answer · 2015-12-12T12:51:14.217

7

You can use np.maximum.reduceat:

>>> _, idx = np.unique(g, return_index=True)
>>> np.maximum.reduceat(v, idx)
array([ 4, 74, 10])

More about the workings of the ufunc reduceat method can be found here.

Remark about performance

np.maximum.reduceat is very fast. Generating the indices idx is what takes most of the time here.

While _, idx = np.unique(g, return_index=True) is an elegant way to get the indices, it is not particularly quick.

The reason is that np.unique needs to sort the array first, which is O(n log n) in complexity. For large arrays, this is much more expensive than using several O(n) operations to generate idx.

Therefore, for large arrays it is much faster to use the following instead:

idx = np.concatenate([[0], 1+np.diff(g).nonzero()[0]])
np.maximum.reduceat(v, idx)

edited Dec 12 '15 at 12:51

answered Dec 10 '15 at 18:05

Alex Riley

169,130
45
262
238

amazing! everytime I discover some new strange `numpy` function ;D – Antonio Ragagnin Dec 10 '15 at 18:12
Thanks for the suggestion @Divakar - that's much nicer. – Alex Riley Dec 10 '15 at 18:12
`reduceat` combined with a ranking (here not necessary because `g` is already in a nice form) is I think the "approved" way to get a groupby in pure numpy. – DSM Dec 10 '15 at 18:32

Mazdak · Answer 2 · 2015-12-11T10:09:47.313

2

You can create your mask like following and use map function :

>>> mask=np.diff(g)!=0
>>> map(np.max,np.split(v,np.where(mask)[0]+1))
[4, 74, 10]

If you don't want to get a generator with map you can use a list comprehension to achieve same result in list, and note that the iteration of list comprehension has performed at C language speed inside the interpreter, like built-in functions.

[np.max(arr) for arr in np.split(v,np.where(mask)[0]+1)]

But I think the numpythonic solution still is better to use.

edited Dec 11 '15 at 10:09

answered Dec 10 '15 at 17:58

Mazdak

105,000
18
159
188

nice one, I will probably use it as long as I do not have any `numpy`tonic solution. In fact, this will have a (non C based) loop over the subset, which in my real case, is quite large. – Antonio Ragagnin Dec 10 '15 at 18:05
1

@AntonioRagagnin `map()` is a built in function in python and its iterations performed at C language speed inside the interpreter. – Mazdak Dec 10 '15 at 18:08
interesting, please see my update on the anwer where I compared the codes. – Antonio Ragagnin Dec 11 '15 at 10:02
Thanks for the answer, I really didn't know those iterations where made at a C level, and in fact they are ways faster than I thougt, and difference with numpy comes only with objects of sizes >10000000 elements – Antonio Ragagnin Dec 11 '15 at 10:29
@AntonioRagagnin Your welcome. Yes, numpy showe it's power against huge data sets ;-) Read this for more info http://stackoverflow.com/questions/31598677/why-list-comprehension-is-much-faster-than-numpy-for-multiplying-arrays – Mazdak Dec 11 '15 at 10:31

Divakar · Answer 3 · 2015-12-10T18:33:05.940

Here's one convoluted vectorized approach using masking and broadcasting that puts each group into rows of a regular 2D array and then finds maximum along each row -

# Mask of valid numbers from each group to be put in a regular 2D array
counts = np.bincount(g)
mask = np.arange(counts.max()) < counts[:,None]

# Group each group into rows of a 2D array and find max along ech row
grouped_2Darray = np.empty(mask.shape)
grouped_2Darray.fill(np.nan)
grouped_2Darray[mask] = v
out = np.nanmax(grouped_2Darray,1)

Sample run -

In [52]: g
Out[52]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2])

In [53]: v
Out[53]: array([ 1,  2,  3,  4, 74, 73, 72, 71,  9, 10])

In [54]: grouped_2Darray # Notice how elements from v are stacked
Out[54]: 
array([[  1.,   2.,   3.,   4.],
       [ 74.,  73.,  72.,  71.],
       [  9.,  10.,  nan,  nan]])

In [55]: np.nanmax(grouped_2Darray,1)
Out[55]: array([  4.,  74.,  10.])

numpy, get maximum of subsets

Tests:

3 Answers3

Linked