numpy array mapping and take average

Question

I have three arrays

import numpy as np
value = np.array ([1, 3, 3, 5, 5, 7, 3])
index = np.array ([1, 1, 3, 3, 6, 6, 6])
data  = np.array ([1, 2, 3, 4, 5, 6])

Arrays "index" & "value" have same size and I want to group the items in "value" by taking average. For example: For the first two items [1, 3, ... in "value", have the same key 1 in "index", so for the final array the value is the mean of the 1st & 2rd items in value : (1 + 3 )/2 which is equal 2

The final array is:

[2, nan, 4, nan, nan, 5]

first value is the average of 1st and 2nd of "value"
second value is nan because there is not any key in "index" (no "2" in array index)
third value is the average of 3rd and 4th of "value" ...

Thanks for your help!!!

Regards, Roy

"[...]because there is not any key in index" - can you explain how the indices in the index array relate to the average values any better? — Jim Brissom, Jan 13 '11 at 02:00
Oh sorry may be my explain no clear Arrays "index" & "value" have same size and I want to group the items in "value" by taking average For example: For the first two items [1, 3, ... in value have the same key 1 in "index", so for the final array the value is the mean of the 1st & 2rd items in value : (1 + 3 )/2 which is equal 2 — Roy, Jan 13 '11 at 02:08
Just edit your original posting. Comments are not really made for that. — Jim Brissom, Jan 13 '11 at 02:13

score 3 · Answer 1 · answered Jan 13 '11 at 07:40

3

>>> [value[index==i].mean() for i in data]
[2.0, nan, 4.0, nan, nan, 5.0]

answered Jan 13 '11 at 07:40

Steve Tjoa

59,122
18
90
101

score 3 · Answer 2 · answered Jan 13 '11 at 10:21

3

Maybe you would like to use numpy.bincount()?

value = np.array([1, 3, 3, 5, 5, 7, 3])
index = np.array([1, 1, 3, 3, 6, 6, 6])
np.bincount(index, value) / np.bincount(index)
# array([ NaN,   2.,  NaN,   4.,  NaN,  NaN,   5.])

answered Jan 13 '11 at 10:21

Sven Marnach

574,206
118
941
841

score 0 · Answer 3 · answered Jan 13 '11 at 02:23

0

Is this the general idea you are looking for?

import numpy as np
value = np.array ([1, 3, 3, 5, 5, 7, 3])
index = np.array ([1, 1, 3, 3, 6, 6, 6])
data  = np.array ([1, 2, 3, 4, 5, 6])

answer = np.array(data, dtype=float)
for i, e in enumerate(data):
    idx = np.where(index==e)[0]
    val = value[idx]
    answer[i] = np.mean(val)

print answer # [  2.  nan   4.  nan  nan   5.]

If your data array is very large, there may be better solutions.

answered Jan 13 '11 at 02:23

Paul

42,322
15
106
123

yes my data is actually very large :P, around 4320000 records. Sorry for unclear ask. – Roy Jan 13 '11 at 02:31
how big is value and index then? – Paul Jan 13 '11 at 02:35
is a len(value) by len(data) 2D array too big to fit in memory? – Paul Jan 13 '11 at 02:42
For "value" and "index" size is 4320000 , for "data" is smaller, 1124000 , the memory is not enough to make that huge array – Roy Jan 13 '11 at 02:51
Then I think I'd stick with the above solution. You could use an array mask instead of `where` to try to optimize, but I think you are stuck iterating with python. If it is still to slow, you can try cython. – Paul Jan 13 '11 at 03:12
Then its a new time to study a new library. Anyway, thanks so much!! – Roy Jan 13 '11 at 03:33

score 0 · Answer 4 · answered Jan 13 '11 at 09:52

I have searched for use numpy histogram to solve the huge array:

value = np.array ([1, 3, 3, 5, 5, 7, 3], dtype='float')
index = np.array ([1, 1, 3, 3, 6, 6, 6], dtype='float')
data = np.array ([1, 2, 3, 4, 5, 6])

sums = np.histogram(index , bins=np.arange(index.min(), index.max()+2), weights=value)[0]
counter = np.histogram(index , bins=np.arange(index.min(), index.max()+2))[0]

sums / counter

array([ 2., NaN, 4., NaN, NaN, 5.])

numpy array mapping and take average

4 Answers4

Linked