2

I have an array with 1M floats and want to get the top 10. Wondering if there's an argsort that will only give top 10 to speed things up. I suppose I could do np.argmax() 10 times, dropping the resulting index each time (probably by making it -inf, to not have to keep track of shifting indices). Just checking if there's a well-known inbuilt way.

Doesn't necessarily have to be numpy. Second best is another very common library.

Alexander Soare
  • 2,825
  • 3
  • 25
  • 53
  • Why not just use np.argsort()? – sehan2 Aug 01 '21 at 16:04
  • @sehan2 because I don't want to compute the whole sort. I presume it would be a LOT faster to just get the top 10 – Alexander Soare Aug 01 '21 at 16:05
  • for the first 10 you also have to look at all your floats aswell – sehan2 Aug 01 '21 at 16:06
  • There are some sorting algorithms (insertion sort, selection sort, etc.) that can efficiently find the "largest" elements first, but for the real-world efficient sorting algorithms (quicksort, merge sort, etc.), there's not really an obvious "short-circuit" mechanism where you could stop after sorting part of the list. It's a good question, but to benefit from the short-circuiting, you'd likely have to switch to a less efficient sorting algorithm, and at that point it's possible you've lost more than you've gained. – Silvio Mayolo Aug 01 '21 at 16:07
  • @sehan2 I won't profess to know a lot about sorting algorithms, but my general knowledge tells me that there's more to it than just looking at all the floats. Maybe someone will correct me. – Alexander Soare Aug 01 '21 at 16:09
  • 1
    In CS this is known as partial sorting (which should help your googling). You might be able to leverage [`numpy.partition()`](https://numpy.org/doc/stable/reference/generated/numpy.partition.html) for this somehow. – Ari Cooper-Davis Aug 01 '21 at 16:11
  • 2
    [```numpy.argpartition```](https://numpy.org/doc/stable/reference/generated/numpy.argpartition.html) Is the closest thing I know of in numpy. I believe you can do ```arr.argpartition(kth=(0,1,2,3,4,5,6,7,8,9))[:10]``` to get the index of the top 10 elements (in their sorted order) in the array ```arr```. – Kevin Aug 01 '21 at 16:21
  • There is also [`heapq.nlargest`](https://docs.python.org/3/library/heapq.html#heapq.nlargest) – Stef Aug 01 '21 at 16:32
  • @Stef although I don't think you can get the indices using `heapq.nlargest()` without additional processing? – Ari Cooper-Davis Aug 01 '21 at 16:39
  • 1
    @AriCooper-Davis Fair enough. Although the additional processing is as simple as replacing each element by the pair `(element, index)` in the initial list. – Stef Aug 01 '21 at 21:47

1 Answers1

2

You can indeed use numpy.argpartition():

import numpy
a = numpy.random.rand(1000000)
b = numpy.argpartition(a, [0,1,2,3,4,5,6,7,8,9])[:10]

It's nearly 10 times faster (in this case, on my machine):

>>> import timeit, numpy
>>> a = numpy.random.rand(1000000)
>>> # Sorting traditionally
>>> timeit.timeit('numpy.argsort(a)[:10]', globals=globals(), number=100)
10.2261...
>>> # Getting the top 10 but not in order
>>> timeit.timeit('numpy.argpartition(a, 10)[:10]', globals=globals(), number=100)
1.2125...
>>> # Getting the top 10 in order
>>> timeit.timeit('numpy.argpartition(a, [0,1,2,3,4,5,6,7,8,9])[:10]', globals=globals(), number=100)
1.4764...
Ari Cooper-Davis
  • 3,374
  • 3
  • 26
  • 43