4

I wrote a code in python that takes a numpy matrix as input and returns a list of indices grouped by the corresponding values (i.e. output[3] returns all indices with value of 3). However, I lack the knowledge of writing vectorized code and had to do it using ndenumerate. This operation only took about 9 seconds which is too slow.

The second idea that I had was using numpy.nonzero as follows:

for i in range(1, max_value):
   current_array = np.nonzero(input == i)
   # save in an array

This took 5.5 seconds and so it was a good improvement but still slow. Any way to do it without loops or optimized way to get the pairs of indices per value?

Sami
  • 1,369
  • 14
  • 35
  • To find all nonzero elements in an array, by necessity you have to check all the elements. So you can't really do "without loops". – Luigi Apr 18 '14 at 18:28
  • 2
    @Louis While you need to loop over the values, if you can avoid an explicit Python loop, which is very slow, and replace it with an implicit, vectorized, NumPy loop, the speed-up is generally of a couple orders of magnitude. – Jaime Apr 18 '14 at 20:08
  • FWIW, two-thirds of the time people are looking for a `groupby` operation for numpy arrays they'd be better off with a pandas `DataFrame`. It's impossible to tell which category you're in without information about the actual problem being solved, though. – DSM Apr 18 '14 at 20:32
  • Thank you guys for the comments. Regarding the data, it is a 2D matrix with values ranging from 0 to max. I know the value of max and the matrix has all the values between 1 and max (I don't care about the zero here). The values corresponding to a previous clustering operation and therefore have no clue about the maximum number of indices to expect since the input of the clustering operation is a user file. The two provided solutions cover the needs to solve my problem. – Sami Apr 18 '14 at 20:42

2 Answers2

3

Here's an O(n log n) algorithm for your problem. The obvious looping solution is O(n), so for sufficiently large datasets this will be slower:

>>> a = np.random.randint(3, size=10)
>>> a
array([1, 2, 2, 0, 1, 0, 2, 2, 1, 1])

>>> index = np.arange(len(a))
>>> sort_idx = np.argsort(a)
>>> cnt = np.bincount(a)
>>> np.split(index[sort_idx], np.cumsum(cnt[:-1]))
[array([3, 5]), array([0, 4, 8, 9]), array([1, 2, 6, 7])]

It will depend on the size of your data, but it is reasonably fast for largish data sets:

In [1]: a = np.random.randint(1000, size=1e6)

In [2]: %%timeit
   ...: indices = np.arange(len(a))
   ...: sort_idx = np.argsort(a)
   ...: cnt = np.bincount(a)
   ...: np.split(indices[sort_idx], np.cumsum(cnt[:-1]))
   ...: 
10 loops, best of 3: 140 ms per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • This is extremely fast in my case, it took 40ms with the same data. Thanks a lot. One small question, any chance of extending this method for 2D without doing ravel() or reshape()? The 40ms includes the reshaping, but it would be nice to avoid unnecessary memory usage. Thanks again Jaime – Sami Apr 18 '14 at 20:35
  • 1
    Because it uses sorting it needs a flattened array. But ravelling and reshaping are in most cases very cheap operations, as they do not copy any data around, just change the way it is viewed. So I do not think you should be worried about missing optimization. – Jaime Apr 19 '14 at 21:03
  • Perfect then, I didn't know that it does not clone the data. Thanks a lot Jaime for the assistance. – Sami Apr 20 '14 at 08:25
1

If you're willing to use some extra memory, you can vectorize by broadcasting:

import numpy as np
input = np.random.randint(1,max_value, 100)
indices = np.arange(1, max_value)

matches = input == indices[:,np.newaxis]  # broadcasts across each index

Then, the matches for each index i are simply np.nonzero(matches[i]).

perimosocordiae
  • 17,287
  • 14
  • 60
  • 76