9

I went through several questions on StackOverflow but could not find the relevant answer. I want to fetch indices of k maximum values from a numpy ndarray. This link discusses the same but for 1D array. np.argsort for 2D array resulted into sorting of elements row-wise . i.e

Note: array elements are not unique.

input:

import numpy as np
n = np.arange(9).reshape(3,3)
>>> n
array([[0, 1, 2],
   [3, 4, 5],
   [6, 7, 8]])
s = n.argsort()
>>> s
array([[0, 1, 2],
   [0, 1, 2],
   [0, 1, 2]], dtype=int32)

Also,

import numpy as np
n = np.arange(9).reshape(3,3)
s = n.argsort(axis=None)
>>>s
array([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype=int32)

but I am loosing the array structure here and can not redeem the original indices of the elements.

Any knid of help is appreciated.

Community
  • 1
  • 1
Rashmi Singh
  • 519
  • 1
  • 8
  • 20

1 Answers1

12

Couple of approaches with np.argpartition and np.argsort for ndarrays -

def k_largest_index_argpartition_v1(a, k):
    idx = np.argpartition(-a.ravel(),k)[:k]
    return np.column_stack(np.unravel_index(idx, a.shape))

def k_largest_index_argpartition_v2(a, k):
    idx = np.argpartition(a.ravel(),a.size-k)[-k:]
    return np.column_stack(np.unravel_index(idx, a.shape))

def k_largest_index_argsort(a, k):
    idx = np.argsort(a.ravel())[:-k-1:-1]
    return np.column_stack(np.unravel_index(idx, a.shape))

Discussion on two versions with argpartition

Difference between k_largest_index_argpartition_v1 and k_largest_index_argpartition_v2 is how we are using argparition. In the first version, we are negating input array and then using argpartition to get the indices for the smallest k indices, thus effectively getting the largest k indices, whereas in the second version, we are getting the first a.size-k smallest indices and then we are choosing the leftover largest k indices.

Also, its worth mentioning here that with argpartition, we are not getting the indices in their sorted order. If the sorted order is needed, we need to feed in range array to np.argpartition, as mentioned in this post.

Sample runs -

1) 2D case :

In [42]: a    # 2D array
Out[42]: 
array([[38, 14, 81, 50],
       [17, 65, 60, 24],
       [64, 73, 25, 95]])

In [43]: k_largest_index_argsort(a, k=2)
Out[43]: 
array([[2, 3],
       [0, 2]])

In [44]: k_largest_index_argsort(a, k=4)
Out[44]: 
array([[2, 3],
       [0, 2],
       [2, 1],
       [1, 1]])

In [66]: k_largest_index_argpartition_v1(a, k=4)
Out[66]: 
array([[2, 1], # Notice the order is different
       [2, 3],
       [0, 2],
       [1, 1]])

2) 3D case :

In [46]: a # 3D array
Out[46]: 
array([[[20, 98, 27, 73],
        [33, 78, 48, 59],
        [28, 91, 64, 70]],

       [[47, 34, 51, 19],
        [73, 38, 63, 94],
        [95, 25, 93, 64]]])

In [47]: k_largest_index_argsort(a, k=2)
Out[47]: 
array([[0, 0, 1],
       [1, 2, 0]])

Runtime test -

In [56]: a = np.random.randint(0,99999999999999,(3000,4000))

In [57]: %timeit k_largest_index_argsort(a, k=10)
1 loops, best of 3: 2.18 s per loop

In [58]: %timeit k_largest_index_argpartition_v1(a, k=10)
10 loops, best of 3: 178 ms per loop

In [59]: %timeit k_largest_index_argpartition_v2(a, k=10)
10 loops, best of 3: 128 ms per loop
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Just for clarity, `argpartition` won't give the top `k` in order, just the top `k` in their initial order. A way to have the best of both worlds (speed of `argpartition`, order of `argsort`) is to sort after the partition: do `idx2=np.argsort(-a.ravel()[idx])` and then `row,col = np.unravel.index(idx[idx2], a.shape)` – Daniel F Apr 13 '17 at 09:00
  • Well that's easier! – Daniel F Apr 13 '17 at 09:49
  • "As such, the first version would be better" - why does it matter whether you take the first or last `k`? Argpartition doesn't sort the first `k`, does it? – Eric Apr 13 '17 at 10:28
  • Also, the cost of doing `-a.ravel()` is somewhat significant here – Eric Apr 13 '17 at 10:37
  • In fact, for any `k not in [1,2]`, the second version is consistently faster - so you just got lucky with your choice of small `k` – Eric Apr 13 '17 at 10:39
  • @Eric Exactly, I found out something peculiar that for `k=2`, `argpartition` runtime is a small number, but for `3` onwards, it shoots up and stays more or less constant thereafter. Thus, if we do : `a = np.random.randint(0,99999999999999,(10000000))` and then `np.argpartition(a,2)` comes at `39msec`, whereas with `k=3`, gets to `90.6 ms `. – Divakar Apr 13 '17 at 10:42
  • @Eric So, I made a wrong assumption there about `argpartition`, removed that part. Thanks for pointing that out! – Divakar Apr 13 '17 at 10:53
  • This is so good, I have to cite it. Did you want credit in any other way besides a link here? – mLstudent33 Nov 05 '19 at 08:27
  • @mLstudent33 Citation where exactly? – Divakar Nov 05 '19 at 08:30
  • In my masters degree assignment so no real publicity unfortunately. I just pasted th link directly above the code snippet where I use it. – mLstudent33 Nov 05 '19 at 08:32
  • 1
    @mLstudent33 So, what I do when I am quoting codes from elsewhere on here, I put the link and the author name with `@` on top of the function code, something like in here - https://stackoverflow.com/a/51951787/. So, I guess that could be used in assignments/outside of Stackoverflow too. – Divakar Nov 05 '19 at 08:40