32

Short Question

I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.

How do I extract the indices of each bin to efficiently perform my calculation using the bins values?

What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?

Detailed Explanation

1. Straightforward Solution

One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below

import numpy as np

# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
    y = np.sum(x)
    return y

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = [func(vals[ind == j]) for j in range(1, nbins)]

This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.

2. Using binned_statistics

The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following

import numpy as np
from scipy.stats import binned_statistics

vals = np.random.random(1e8)
results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]

3. Using labeled_comprehension

Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become

import numpy as np
from scipy.ndimage import labeled_comprehension

vals = np.random.random(1e8)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)

Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.

4. Comparison with IDL language

To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?

Specifically, using the IDL language the above example could be written as

vals = randomu(s, 1e8)
nbins = 100
bins = [0:1:1./nbins]
h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)
result = dblarr(nbins)

for j=0, nbins-1 do begin
    jbins = r[r[j]:r[j+1]-1]  ; Selects indices of bin j
    result[j] = func(vals[jbins])
endfor

The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.

divenex
  • 15,176
  • 9
  • 55
  • 55
  • I added a note about scipy.stats.binned_statistic, which is meant to address precisely my question, but use the non-optimal approach. – divenex Nov 11 '14 at 19:01
  • For completeness I included an example using Scipy's labeled_comprehension, which is still inefficient – divenex Nov 16 '14 at 17:54

6 Answers6

12

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np
from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    N = len(values)
    r0, r1 = range

    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)
    S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

    return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

divenex
  • 15,176
  • 9
  • 55
  • 55
  • Another clever answer! This is the fastest answer so far. With the usual settings as in my original example, the first solution takes 14.1s. Interestingly a line_profiler analysis indicates that 98% of the time is spend in the sort() line. Thus, the speed of this approach, like @Giulio's one, is nearly independent of the bins number as required (14.2s with nbins=1000). This solution should replace the one in scipy.stats.binned_statistic... – divenex Nov 12 '14 at 15:08
  • Can one generalize this approach to the case I actually need, where the selection variable `x` is not the same as the one `values` from which one needs to compute the statistics. Like in `scipy.stats.binned_statistic(x, values)`? In that case it seems indices are actually needed. – divenex Nov 12 '14 at 16:16
  • @divenex: Indeed, for that case you'll need to get indices of some kind. The two methods in my answer can be modified to use `argsort` and `argpartition` respectively, but that will seriously worsen the run time. The first will become equivalent to Giulio's answer. –  Nov 13 '14 at 08:37
  • 1
    @divenex: I added a new method to my answer, I'm curious how it compares to your IDL version. –  Nov 14 '14 at 15:00
  • Absolutely brilliant! – divenex Nov 15 '14 at 23:34
  • The last solution by moarningsun, using csr_matrix, finally manages to match (i) the speed, (ii) scalability and (iii) generality of the IDL implementation using REVERSE_INDICES. I measured 5.1s (as in IDL) with nbins=100 and still just 7.6s with nbins=1000. The improvement is quite important as it could be used to rewrite the logic of both scipy.stats.binned_statistic and the related scipy.ndimage.labeled_comprehension for the general task of applying a function to a set of blobs. Thank you very much moarningsun! – divenex Nov 15 '14 at 23:44
  • Importantly, moarningsun's solution still produces *identical* results as my original slow solution, for all 16 digits, when using the very same input. – divenex Nov 15 '14 at 23:47
  • @divenex: In terms of rewriting stuff for Scipy I think it would be better if `numpy.histogram` would have the option to return indices just like in your IDL example. In addition, `numpy.histogram` and `numpy.digitize` would benefit from having fast code paths for when the bins are uniform. Also, my approach with the linear transformation doesn't regard the last bin as a closed interval, which could be a significant defect for some data sets. –  Nov 17 '14 at 10:08
  • I assume that in this answer, `values` is the `vals` from the question, that `nbins` and `func` are the same as in the question, and that `range` is `(0,1)` as per the `np.linspace` line of the question. However, what is the `x`? I can't get this answer to work using any of the assumptions I've made about what it might be! – John Coxon Apr 18 '16 at 10:01
  • 1
    @JohnCoxon - It should accept more or less the same arguments as [`scipy.stats.binned_statistic`](http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.binned_statistic.html) –  Apr 19 '16 at 12:08
  • @user2379410 Could you, please, provide a modification for multivariate input data, `x.shape==(N, D)`? In other words, how to make `np.histogramdd` return indices? – dizcza Feb 24 '18 at 15:12
4

I assume that the binning, done in the example with digitize, cannot be changed. This is one way to go, where you do the sorting once and for all.

vals = np.random.random(1e4)
nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

new_order = argsort(ind)
ind = ind[new_order]
ordered_vals = vals[new_order]
# slower way of calculating first_hit (first version of this post)
# _,first_hit = unique(ind,return_index=True)
# faster way:
first_hit = searchsorted(ind,arange(1,nbins-1))
first_hit.sort()

#example of using the data:
for j in range(nbins-1):
    #I am using a plotting function for your f, to show that they cluster
    plot(ordered_vals[first_hit[j]:first_hit[j+1]],'o')

The figure shows that the bins are actually clusters as expected: enter image description here

gg349
  • 21,996
  • 5
  • 54
  • 64
  • This is an elegant way to keep the indices selection outside the for loop as needed. Unfortunately, with my required number of input values and bin numbers, the computation time is slightly slower than my original implementation above. Adding a couple of clock() commands I measure 42.4s for my implementation and 49.5 for @giulio-ghirardo answer. Note that the IDL version takes just 5.0s with identical setting as my Python version, but it still takes just 5.9s with nbins=1000. While the calculation time of my Python version scales nearly as nbins. – divenex Nov 07 '14 at 11:31
  • In my test I obviously removed the plot and replaced it with `result = [func(ordered_vals[first_hit[j]:first_hit[j+1]]) for j in range(nbins-1)]` – divenex Nov 07 '14 at 11:43
  • +1 I would have solved it in a pretty similar way. Still weird that this is slower than the original 'naive' solution for some sizes. Maybe the `nbin` times brute-force comparison of a vector of length `n` is faster than all the sorting you do (which should scale with `n*log(n)`) when `nbin` is small enough? – Bas Swinckels Nov 07 '14 at 22:38
  • A line-by-line profiling indicates that 52% of the time for the above solution is spent in argsort() and 20% in unique(). As expected the "result =" line contributes a negligible 0.8%. – divenex Nov 11 '14 at 09:32
  • I have changed the solution, so that now `searchsorted` is used instead of `unique`. This makes the calculation a bit faster, but as @divenex points out the big bottleneck is with `argsort`. in reply to @Bas Swinckles, I would expect this solution to be faster when the number of bins `nbins` is larger, because in the original solution the bottleneck is in the operation `vals[ind == j]`, which would be evaluated more often. – gg349 Nov 11 '14 at 15:14
3

You can halve the computation time by sorting the array first, then use np.searchsorted.

vals = np.random.random(1e8)
vals.sort()

nbins = 100
bins = np.linspace(0, 1, nbins+1)
ind = np.digitize(vals, bins)

results = [func(vals[np.searchsorted(ind,j,side='left'):
                     np.searchsorted(ind,j,side='right')])
           for j in range(1,nbins)]

Using 1e8 as my test case, I go from 34 seconds of computation to about 17.

Hooked
  • 84,485
  • 43
  • 192
  • 261
  • This is also a very good answer. Indeed I confirm that I go from the 42.7s of my version above to 20.6s with Hooked version, with identical output. Interestingly the computation time of Hooked version does not scale as bad as mine when increasing nbins. I measured 53.5s with nbins=1000. Still I feel there should be a way to approach, with Numpy, the IDL efficiency (5.9s with nbins=1000) on this rather general task. So I still wait for a definitive answer to my question. – divenex Nov 07 '14 at 11:56
  • 1
    @divenex I think it scales betters because it is an algorithmic change. The sorting should be done in `O(n ln n)` time, while the old, double loop, method should be `O(n^2)`. – Hooked Nov 07 '14 at 15:02
2

2023 update: New Scipy value_indices

For the records, eight years after my original question, Scipy 1.10 in January 2023 introduced a new function scipy.ndimage.value_indices which does exactly what I asked in my question. The documentation even explicitly mentions that they tried to emulate the IDL functionality

Note for IDL users: this provides functionality equivalent to IDL’s REVERSE_INDICES option (as per the IDL documentation for the HISTOGRAM function).

Using the new Scipy function, the equivalent of the function suggested in the accepted answer would be the following

import numpy as np
from scipy.ndimage import value_indices

def binned_statistic(x, values, func, nbins, extent):
    '''The usage is nearly the same as scipy.stats.binned_statistic''' 

    r0, r1 = extent
    digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)

    ind = value_indices(digitized)

    return [func(values[j]) for j in ind.values()]

This function can be used as follow

import numpy as np

# This function func() is just a placemark for a much more complicated function.
# I am aware that my problem could be easily sped up in the specific case of
# of the sum() function, but I am looking for a general solution to the problem.
def func(x):
    y = np.sum(x)
    return y

vals = np.random.random(int(1e8))
nbins = 100
extent = [0, 1]
res = binned_statistic(x, vals, func, nbins, extent)

I timed the new function against the currently accepted answer and found that it has comparable speed on the given example, however it is 1.7 times slower. For this reason it is not obvious this should become the accepted answer, as the efficiency depends on the size of the problem.

divenex
  • 15,176
  • 9
  • 55
  • 55
1

One efficient solution is using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
npi.group_by(ind).split(vals)
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Thanks. Could you please provide a minimally working example answering the above question using your numpy_indexed package? – divenex Apr 05 '16 at 09:59
  • In other words, could you edit your answer by showing how to efficiently apply a generic `func` to produce `result`, as in the other examples? – divenex Apr 05 '16 at 10:32
  • You can pass reduce=func to group_by, as in npi.group_by(ind, vals, func); I don't expect it to be any faster than the currently accepted answer though; just more tidy – Eelco Hoogendoorn Apr 05 '16 at 11:12
  • this is what group_by does internally with your func parameter: return [(key, reduction(group)) for key, group in zip(g.unique, groups)] – Eelco Hoogendoorn Apr 05 '16 at 11:14
  • I edited this answer to explicitly address the question. Unfortunately the `numpy_indexed` package is much slower than the accepted question on this example. I measured 32s with `numpy_indexed`, compared to 5s (!) for the accepted answer using `csr_matrix` and 65s using `binned_statistics` in the way described in the original question. In all three cases the results are identical in every digit. – divenex Apr 05 '16 at 13:41
  • The actual command should be `keys, result = np.array(npi.group_by(ind, values=vals, reduction=func)).T`. @Eelco it would be useful for you to write this explicitly in your answer, assuming this is your intended usage. – divenex Apr 05 '16 at 15:03
  • There are various ways of addressing this problem, depending on the exact output format you want; which is something your question does not specify. If you insist on calling a custom python function on your groups, full vectorization goes out of the window, so I don't see the added value of your cast to array; infact, it is liable to cause trouble with dtypes between keys and result, so zipping would be preferred here. Indeed one may expect the accepted answer to be faster; thanks for benchmarking it! – Eelco Hoogendoorn Apr 05 '16 at 16:24
0

Pandas has a very fast grouping code (I think it's written in C), so if you don't mind loading the library you could do that :

import pandas as pd

pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').sum().values

or more generally :

pdata=pd.DataFrame({'vals':vals,'ind':ind})
resultsp = pdata.groupby('ind').agg(func).values

Although the latter is slower for standard aggregation functions (like sum, mean, etc)

Georgy
  • 12,464
  • 7
  • 65
  • 73
  • 3
    I timed it on my original example above. This is slightly slower than the "Straightforward Solution" in my question, and about 13x slower than the accepted answer. – divenex Feb 22 '18 at 15:00