aggregate values of one colum by classes in second column using numpy

Question

I've a numpy array with shape N,2 and N>10000. I the first column I have e.g. 6 class values (e.g. 0.0,0.2,0.4,0.6,0.8,1.0) in the second column I have float values. Now I want to calculate the average of the second column for all different classes of the first column resulting in 6 averages one for each class.

Is there a numpy way to do this, to avoid manual loops especially if N is very large?

For problems like that, you may want to use `pandas`, see: http://pandas.pydata.org/pandas-docs/dev/groupby.html — cel, Feb 22 '15 at 18:39
This is a "groupby/aggregation" operation. The question is *this close* to being a duplicate of http://stackoverflow.com/questions/28597383/getting-median-of-particular-rows-of-array-based-on-index. The `pandas` code that I gave there should also work here (with the obvious change of `median` to `mean`). You could also use `scipy.ndimage.labeled_comprehension` as suggested there, but you would have to convert the first column to integers (e.g. `idx = (5*data[:, 0]).astype(int)`. — Warren Weckesser, Feb 22 '15 at 20:53
But if you don't want any additional dependencies, @Jaime's answer is a good one. — Warren Weckesser, Feb 22 '15 at 21:09
The labeled_comprehension approach seems to be the best for my application since I can replace the mean by other aggregates and I don't need additional packages. Thank you very much. — Michael Hecht, Feb 23 '15 at 08:39

score 3 · Answer 1 · answered Feb 22 '15 at 19:49

3

In pure numpy you would do something like:

unq, idx, cnt = np.unique(arr[:, 0], return_inverse=True,
                          return_counts=True)
avg = np.bincount(idx, weights=arr[:, 1]) / cnt

answered Feb 22 '15 at 19:49

Jaime

65,696
17
124
159

Ok, this seems to work (from numpy 1.9.x), but only for mean. Other aggregations (median, stdev, etc.) cannot be calculated so directly. – Michael Hecht Feb 23 '15 at 08:29
Well, if you ask for a way to "calculate the average with numpy," it should come as no surprise that you get an answer that calculates the average with numpy... And just so you know, `labeled_comprehension` is just a big fat for loop in hiding, see [here](https://github.com/scipy/scipy/blob/v0.15.1/scipy/ndimage/measurements.py#L406). If you want performing generic aggregation functionality, either use pandas built-ins, or hack them yourself in numpy. – Jaime Feb 23 '15 at 15:04

score 0 · Answer 2 · edited May 23 '17 at 12:12

I copied the answer from Warren to here, since it solves my problem best and I want to check it as solved:

This is a "groupby/aggregation" operation. The question is this close to being a duplicate of getting median of particular rows of array based on index. ... You could also use scipy.ndimage.labeled_comprehension as suggested there, but you would have to convert the first column to integers (e.g. idx = (5*data[:, 0]).astype(int)

I did exactly this.

aggregate values of one colum by classes in second column using numpy

2 Answers2