3

I have three arrays

import numpy as np
value = np.array ([1, 3, 3, 5, 5, 7, 3])
index = np.array ([1, 1, 3, 3, 6, 6, 6])
data  = np.array ([1, 2, 3, 4, 5, 6])

Arrays "index" & "value" have same size and I want to group the items in "value" by taking average. For example: For the first two items [1, 3, ... in "value", have the same key 1 in "index", so for the final array the value is the mean of the 1st & 2rd items in value : (1 + 3 )/2 which is equal 2

The final array is:

[2, nan, 4, nan, nan, 5]

first value is the average of 1st and 2nd of "value"
second value is nan because there is not any key in "index" (no "2" in array index)
third value is the average of 3rd and 4th of "value" ...

Thanks for your help!!!

Regards, Roy

Roy
  • 297
  • 1
  • 2
  • 9
  • "[...]because there is not any key in index" - can you explain how the indices in the index array relate to the average values any better? – Jim Brissom Jan 13 '11 at 02:00
  • Oh sorry may be my explain no clear Arrays "index" & "value" have same size and I want to group the items in "value" by taking average For example: For the first two items [1, 3, ... in value have the same key 1 in "index", so for the final array the value is the mean of the 1st & 2rd items in value : (1 + 3 )/2 which is equal 2 – Roy Jan 13 '11 at 02:08
  • Just edit your original posting. Comments are not really made for that. – Jim Brissom Jan 13 '11 at 02:13

4 Answers4

3
>>> [value[index==i].mean() for i in data]
[2.0, nan, 4.0, nan, nan, 5.0]
Steve Tjoa
  • 59,122
  • 18
  • 90
  • 101
3

Maybe you would like to use numpy.bincount()?

value = np.array([1, 3, 3, 5, 5, 7, 3])
index = np.array([1, 1, 3, 3, 6, 6, 6])
np.bincount(index, value) / np.bincount(index)
# array([ NaN,   2.,  NaN,   4.,  NaN,  NaN,   5.])
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
0

Is this the general idea you are looking for?

import numpy as np
value = np.array ([1, 3, 3, 5, 5, 7, 3])
index = np.array ([1, 1, 3, 3, 6, 6, 6])
data  = np.array ([1, 2, 3, 4, 5, 6])

answer = np.array(data, dtype=float)
for i, e in enumerate(data):
    idx = np.where(index==e)[0]
    val = value[idx]
    answer[i] = np.mean(val)

print answer # [  2.  nan   4.  nan  nan   5.]

If your data array is very large, there may be better solutions.

Paul
  • 42,322
  • 15
  • 106
  • 123
  • yes my data is actually very large :P, around 4320000 records. Sorry for unclear ask. – Roy Jan 13 '11 at 02:31
  • how big is value and index then? – Paul Jan 13 '11 at 02:35
  • is a len(value) by len(data) 2D array too big to fit in memory? – Paul Jan 13 '11 at 02:42
  • For "value" and "index" size is 4320000 , for "data" is smaller, 1124000 , the memory is not enough to make that huge array – Roy Jan 13 '11 at 02:51
  • Then I think I'd stick with the above solution. You could use an array mask instead of `where` to try to optimize, but I think you are stuck iterating with python. If it is still to slow, you can try cython. – Paul Jan 13 '11 at 03:12
  • Then its a new time to study a new library. Anyway, thanks so much!! – Roy Jan 13 '11 at 03:33
0

I have searched for use numpy histogram to solve the huge array:

value = np.array ([1, 3, 3, 5, 5, 7, 3], dtype='float')
index = np.array ([1, 1, 3, 3, 6, 6, 6], dtype='float')
data = np.array ([1, 2, 3, 4, 5, 6])

sums = np.histogram(index , bins=np.arange(index.min(), index.max()+2), weights=value)[0]
counter = np.histogram(index , bins=np.arange(index.min(), index.max()+2))[0]

sums / counter

array([ 2., NaN, 4., NaN, NaN, 5.])

Roy
  • 297
  • 1
  • 2
  • 9