getting median of particular rows of array based on index

Question

I have two arrays of the same length, one containing an index and the other containing its corresponding value i.e. one index can have more than one value:

idx = [0,0,0,1,1,1,2,2,2,3,3,3,4,4,5,5...]
values = [1.2,3.1,3.1,3.1,3.3,1.2,3.3,4.1,5.4...]

I want to return an array which holds the unique index as well as the median value for objects with the same idx value.

e.g.

result = 
    [0, np.median([1.2,3.1,3.1])
     1, np.median([3.1,3.3,1.2])
     2, etc. ]

My brute force approach is to just go:

for idxi in np.arange(np.max(idx)):
    mask = (idxi == idx)
    medians = np.median(values[mask])
    result.append([idxi,medians])

This is far to slow for my needs unfortunately and quite ugly in any case.

Warren Weckesser · Accepted Answer · 2015-02-19T04:38:14.167

If you don't mind a dependency on scipy, the function scipy.ndimage.labeled_comprehension can do this. Here's an example.

First set up the sample data:

In [570]: import numpy as np

In [571]: idx = np.array([0,0,0,1,1,1,2,2,2,3,3,3,4,4,5,5])

In [572]: values = np.array([1.2,3.1,3.1,3.1,3.3,1.2,3.3,4.1,5.4,6,6,6.2,6,7,7.2,7.2])

Get the unique "labels" in idx. (If you already know the maximum is, say, N, and you know that all the integers from 0 to N are used, you could use uniq = range(N+1) instead.)

In [573]: uniq = np.unique(idx)  # Or range(idx.max()+1)

In [574]: uniq
Out[574]: array([0, 1, 2, 3, 4, 5])

Use labeled_comprehension to compute the median of each labeled group:

In [575]: from scipy.ndimage import labeled_comprehension

In [576]: medians = labeled_comprehension(values, idx, uniq, np.median, np.float64, None)

In [577]: medians
Out[577]: array([ 3.1,  3.1,  4.1,  6. ,  6.5,  7.2])

Another option, if you don't mind the dependency on pandas, is to use the groupby function of the pandas.DataFrame class.

Set up the DataFrame:

In [609]: import pandas as pd

In [610]: df = pd.DataFrame(dict(labels=idx, values=values))

In [611]: df
Out[611]: 
    labels  values
0        0     1.2
1        0     3.1
2        0     3.1
3        1     3.1
4        1     3.3
5        1     1.2
6        2     3.3
7        2     4.1
8        2     5.4
9        3     6.0
10       3     6.0
11       3     6.2
12       4     6.0
13       4     7.0
14       5     7.2
15       5     7.2

Use groupby to group the data uses the labels column, and then compute the medians of the groups:

In [612]: result = df.groupby('labels').median()

In [613]: result
Out[613]: 
        values
labels        
0          3.1
1          3.1
2          4.1
3          6.0
4          6.5
5          7.2

Disclaimer: I haven't tried either of those suggestions on large arrays, so I don't know how their performance will compare with your brute force solution or with @Ashwini's answer.

Minor: `median` is a method of groupby objects, so `df.groupby("labels").median()` will work too. — DSM, Feb 19 '15 at 04:35

score 2 · Answer 2 · answered Feb 19 '15 at 01:57

For the idx array you can get the unique items using numpy.unique, and to get the corresponding values from the other array we can use numpy.diff with numpy.where to get indices where the items change. Using these indices we can split the values array using numpy.array_split and then apply np.mean on its items:

>>> idx = np.array([0,0,0,1,1,1,2,2,2,3,3,3,4,4,5,5])
>>> values = np.array([1.2,3.1,3.1,3.1,3.3,1.2,3.3,4.1,5.4] + [1]*7)
>>> indices = np.where(np.diff(idx) != 0)[0] + 1
>>> map(np.mean, np.array_split(values, indices))
[2.4666666666666668, 2.5333333333333337, 4.2666666666666666, 1.0, 1.0, 1.0]
>>> np.unique(idx)
array([0, 1, 2, 3, 4, 5])
>>> np.dstack((_, __))[0]
array([[ 0.        ,  2.46666667],
       [ 1.        ,  2.53333333],
       [ 2.        ,  4.26666667],
       [ 3.        ,  1.        ],
       [ 4.        ,  1.        ],
       [ 5.        ,  1.        ]])

This would only work if the idx values are in sequential order and grouped together right (i.e. [0,0,0,1,1,1...] not [0,1,2,0,1...])? Could this be made to work for any ordering of idx values? I made my example too specific, sorry. — Griff, Feb 19 '15 at 02:13
@Griff If they are not sequential then we can sort them first and sort corresponding values as well(using `numpy.argsort`). This would make it `O(N log N)` in complexity, though we can still do this in `O(N) `time in pure Python, but the excessive for-loops may slow things down. — Ashwini Chaudhary, Feb 19 '15 at 02:17

getting median of particular rows of array based on index

2 Answers2

Linked