1

I've a numpy array with shape N,2 and N>10000. I the first column I have e.g. 6 class values (e.g. 0.0,0.2,0.4,0.6,0.8,1.0) in the second column I have float values. Now I want to calculate the average of the second column for all different classes of the first column resulting in 6 averages one for each class.

Is there a numpy way to do this, to avoid manual loops especially if N is very large?

Michael Hecht
  • 2,093
  • 6
  • 25
  • 37
  • 3
    For problems like that, you may want to use `pandas`, see: http://pandas.pydata.org/pandas-docs/dev/groupby.html – cel Feb 22 '15 at 18:39
  • This is a "groupby/aggregation" operation. The question is *this close* to being a duplicate of http://stackoverflow.com/questions/28597383/getting-median-of-particular-rows-of-array-based-on-index. The `pandas` code that I gave there should also work here (with the obvious change of `median` to `mean`). You could also use `scipy.ndimage.labeled_comprehension` as suggested there, but you would have to convert the first column to integers (e.g. `idx = (5*data[:, 0]).astype(int)`. – Warren Weckesser Feb 22 '15 at 20:53
  • But if you don't want any additional dependencies, @Jaime's answer is a good one. – Warren Weckesser Feb 22 '15 at 21:09
  • The labeled_comprehension approach seems to be the best for my application since I can replace the mean by other aggregates and I don't need additional packages. Thank you very much. – Michael Hecht Feb 23 '15 at 08:39

2 Answers2

3

In pure numpy you would do something like:

unq, idx, cnt = np.unique(arr[:, 0], return_inverse=True,
                          return_counts=True)
avg = np.bincount(idx, weights=arr[:, 1]) / cnt
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Ok, this seems to work (from numpy 1.9.x), but only for mean. Other aggregations (median, stdev, etc.) cannot be calculated so directly. – Michael Hecht Feb 23 '15 at 08:29
  • Well, if you ask for a way to "calculate the average with numpy," it should come as no surprise that you get an answer that calculates the average with numpy... And just so you know, `labeled_comprehension` is just a big fat for loop in hiding, see [here](https://github.com/scipy/scipy/blob/v0.15.1/scipy/ndimage/measurements.py#L406). If you want performing generic aggregation functionality, either use pandas built-ins, or hack them yourself in numpy. – Jaime Feb 23 '15 at 15:04
0

I copied the answer from Warren to here, since it solves my problem best and I want to check it as solved:

This is a "groupby/aggregation" operation. The question is this close to being a duplicate of getting median of particular rows of array based on index. ... You could also use scipy.ndimage.labeled_comprehension as suggested there, but you would have to convert the first column to integers (e.g. idx = (5*data[:, 0]).astype(int)

I did exactly this.

Community
  • 1
  • 1
Michael Hecht
  • 2,093
  • 6
  • 25
  • 37