If you don't mind a dependency on scipy
, the function scipy.ndimage.labeled_comprehension
can do this. Here's an example.
First set up the sample data:
In [570]: import numpy as np
In [571]: idx = np.array([0,0,0,1,1,1,2,2,2,3,3,3,4,4,5,5])
In [572]: values = np.array([1.2,3.1,3.1,3.1,3.3,1.2,3.3,4.1,5.4,6,6,6.2,6,7,7.2,7.2])
Get the unique "labels" in idx
. (If you already know the maximum is, say, N
, and you know that all the integers from 0 to N
are used, you could use uniq = range(N+1)
instead.)
In [573]: uniq = np.unique(idx) # Or range(idx.max()+1)
In [574]: uniq
Out[574]: array([0, 1, 2, 3, 4, 5])
Use labeled_comprehension
to compute the median of each labeled group:
In [575]: from scipy.ndimage import labeled_comprehension
In [576]: medians = labeled_comprehension(values, idx, uniq, np.median, np.float64, None)
In [577]: medians
Out[577]: array([ 3.1, 3.1, 4.1, 6. , 6.5, 7.2])
Another option, if you don't mind the dependency on pandas
, is to use the groupby
function of the pandas.DataFrame
class.
Set up the DataFrame:
In [609]: import pandas as pd
In [610]: df = pd.DataFrame(dict(labels=idx, values=values))
In [611]: df
Out[611]:
labels values
0 0 1.2
1 0 3.1
2 0 3.1
3 1 3.1
4 1 3.3
5 1 1.2
6 2 3.3
7 2 4.1
8 2 5.4
9 3 6.0
10 3 6.0
11 3 6.2
12 4 6.0
13 4 7.0
14 5 7.2
15 5 7.2
Use groupby
to group the data uses the labels
column, and then compute the medians of the groups:
In [612]: result = df.groupby('labels').median()
In [613]: result
Out[613]:
values
labels
0 3.1
1 3.1
2 4.1
3 6.0
4 6.5
5 7.2
Disclaimer: I haven't tried either of those suggestions on large arrays, so I don't know how their performance will compare with your brute force solution or with @Ashwini's answer.