267

Consider the following code:

avgDists = np.array([1, 8, 6, 9, 4])
ids = avgDists.argsort()[:n]

This gives me indices of the n smallest elements. Is it possible to use this same argsort in descending order to get the indices of n highest elements?

Shaido
  • 27,497
  • 23
  • 70
  • 73
shn
  • 5,116
  • 9
  • 34
  • 62
  • 4
    Isn't it simply `ids = np.array(avgDists).argsort()[-n:]`? – Jaime May 10 '13 at 16:50
  • 3
    @Jaime: No, that doesn't work. 'right answer' is `[3, 1, 2]`. Your line produces `[2, 1, 3]` (if n==3 as an example) – dawg May 11 '13 at 03:01
  • 2
    @drewk Well, then make it `ids = np.array(avgDists).argsort()[-n:][::-1]`. The thing is avoiding making a copy of the whole list, which is what you get when you add a `-` in front of it. Not relevant for the OP's small example, could be for larger cases. – Jaime May 11 '13 at 14:46
  • 1
    @Jaime: You are right. See my updated answer. The syntax tho is just opposite from your comment on the ending slice: `np.array(avgDists).argsort()[::-1][:n]` will do it. Also, if you are going to use numpy, stay in numpy. First convert the list to an array: `avgDist=np.array(avgDists)` then it becomes `avgDist.argsort()[::-1][:n}` – dawg May 11 '13 at 18:33

10 Answers10

346

If you negate an array, the lowest elements become the highest elements and vice-versa. Therefore, the indices of the n highest elements are:

(-avgDists).argsort()[:n]

Another way to reason about this, as mentioned in the comments, is to observe that the big elements are coming last in the argsort. So, you can read from the tail of the argsort to find the n highest elements:

avgDists.argsort()[::-1][:n]

Both methods are O(n log n) in time complexity, because the argsort call is the dominant term here. But the second approach has a nice advantage: it replaces an O(n) negation of the array with an O(1) slice. If you're working with small arrays inside loops then you may get some performance gains from avoiding that negation, and if you're working with huge arrays then you can save on memory usage because the negation creates a copy of the entire array.

Note that these methods do not always give equivalent results: if a stable sort implementation is requested to argsort, e.g. by passing the keyword argument kind='mergesort', then the first strategy will preserve the sorting stability, but the second strategy will break stability (i.e. the positions of equal items will get reversed).

Example timings:

Using a small array of 100 floats and a length 30 tail, the view method was about 15% faster

>>> avgDists = np.random.rand(100)
>>> n = 30
>>> timeit (-avgDists).argsort()[:n]
1.93 µs ± 6.68 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
1.64 µs ± 3.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
1.64 µs ± 3.66 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

For larger arrays, the argsort is dominant and there is no significant timing difference

>>> avgDists = np.random.rand(1000)
>>> n = 300
>>> timeit (-avgDists).argsort()[:n]
21.9 µs ± 51.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[::-1][:n]
21.7 µs ± 33.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> timeit avgDists.argsort()[-n:][::-1]
21.9 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Please note that the comment from nedim below is incorrect. Whether to truncate before or after reversing makes no difference in efficiency, since both of these operations are only striding a view of the array differently and not actually copying data.

wim
  • 338,267
  • 99
  • 616
  • 750
  • 15
    It is even more efficient to slice before reversing, i.e., ``np.array(avgDists).argsort()[:-n][::-1]`` – nedim Jul 16 '15 at 09:06
  • 3
    These answers are not equivalent if the original array contains nans. In such a case, the first solution seems to give the more natural result with nans at the end rather than at the beginning. – feilchenfeldt Mar 13 '17 at 12:23
  • 1
    How do these compare when a stable sort is desired? Presumably the slicing strategy reverses equal items? – Eric Aug 06 '17 at 18:28
  • @Eric It depends on the sort implementation chosen (that's a keyword argument `kind` accepted by `argsort`). The default kind is 'quicksort' which makes no guarantee to be stable anyway. If you use a 'mergesort', then you will get a stable sort for the first strategy, and reversal (anti-stability?) for the second strategy. – wim Aug 06 '17 at 19:13
  • @wim Would you mind to expand on a reason, why your recent mod has erased a note on a negation operation, "*(which creates a copy)*" – user3666197 Aug 06 '17 at 19:25
  • 1
    @user3666197 I felt it was not relevant to the answer. Whether the negation creates a copy or not (it does) isn't really important here, the relevant information is that calculating the negation is *O(n)* complexity vs taking another slice which is *O(1)*. – wim Aug 06 '17 at 19:33
  • Thanks for a kind response. IMHO the [SPACE]-domain related troubles are worth not being neglected, even if one emphasises [TIME]-domain of an approach. Given a copy has to be created ( which it sure does have to, unless in-place mode is sometimes possible ) both the large-scale and small-scale computing suffers as on both ends a remarkable [TIME]-domain cost is introduced right by the [SPACE]-domain requirement ( memory allocs + transfers + all that associated cache-misses and many more ). So [SPACE] matters the same as [TIME] if drilling into achieving an ultimate performance. Thanks again. – user3666197 Aug 06 '17 at 19:52
  • 1
    @user3666197 Yes, that's a good point - if an array is taking up 50% available memory, we will certainly want to avoid copying it and causing swapping. I'll edit again to mention that a copy is created there. – wim Aug 06 '17 at 20:41
  • Glad you decided to add this view. While I do not have quantitative evidence at hand, even the small-[SPACE]-sized calculations benefit ( a lot, if keeping (process) & (overheads) & (avoidable overheads) in mind and in respective proportion of the resulting process-latency / -performance ) from excluding a copy -- i.e. all avoidable overheads associated with ( even small ) [SPACE]-operations. Benchmarks will bring hard data into this ( **even nanoseconds do count :o)** ), so I'd dare to limit the applicability to scales that introduce **memIO+ fileIO** swap penalties in large hundreds of [ms]. – user3666197 Aug 06 '17 at 21:08
  • 2
    @nedim comment should be `[-n:]` instead of `[:-n]` – Girardi Apr 25 '20 at 00:49
  • 2
    @nedim Comment is misleading/incorrect, I think. A similar claim may be true for lists (slice is a copy) but the evidence does not support this for numpy arrays (slice is a view). – wim Apr 27 '20 at 16:38
  • Why not `[:-n-1:-1]`? – syockit Mar 13 '21 at 05:11
  • @nedim `[::-1][:n]` is **not** equivalent to `[:-n][::-1]`. It should be `[-n:][::-1]`. – Nermin Nov 10 '22 at 17:30
  • @nedim Aren't they both views? Shouldn't they have the exact same performance? (+/- a nanosecond) – Mateen Ulhaq Jul 13 '23 at 09:03
97

Just like Python, in that [::-1] reverses the array returned by argsort() and [:n] gives that last n elements:

>>> avgDists=np.array([1, 8, 6, 9, 4])
>>> n=3
>>> ids = avgDists.argsort()[::-1][:n]
>>> ids
array([3, 1, 2])

The advantage of this method is that ids is a view of avgDists:

>>> ids.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

(The 'OWNDATA' being False indicates this is a view, not a copy)

Another way to do this is something like:

(-avgDists).argsort()[:n]

The problem is that the way this works is to create negative of each element in the array:

>>> (-avgDists)
array([-1, -8, -6, -9, -4])

ANd creates a copy to do so:

>>> (-avgDists_n).flags['OWNDATA']
True

So if you time each, with this very small data set:

>>> import timeit
>>> timeit.timeit('(-avgDists).argsort()[:3]', setup="from __main__ import avgDists")
4.2879798610229045
>>> timeit.timeit('avgDists.argsort()[::-1][:3]', setup="from __main__ import avgDists")
2.8372560259886086

The view method is substantially faster (and uses 1/2 the memory...)

borgr
  • 20,175
  • 6
  • 25
  • 35
dawg
  • 98,345
  • 23
  • 131
  • 206
  • 4
    This answer is good, but I feel your wording misrepresents the real performance characteristics: *"even with this very small data set, the view method is substantially faster"*. In reality, the negation is *O(n)* and the argsort is *O(n log n)*. This means the timing discrepancy will *diminish* for larger data sets - the *O(n log n)* term dominates, however your suggestion is an optimisation of the *O(n)* part. So the complexity remains the same, and it's for this small data set *in particular* that we see any significant differences. – wim Jul 15 '15 at 21:51
  • 2
    Asymptotically equivalent complexity can still mean that one algorithm is asymptotically twice as fast as another. Throwing away such distinctions can have consequences. For example, even if time discrepancy (as a percentage) does approach 0, I'd be willing to bet that the algorithm with negation still uses twice as much memory. – bug May 16 '19 at 21:06
  • 1
    @bug It can, but it doesn't in this case. I've added some timings to my answer. The numbers show that for larger arrays these approaches have similar timings, which supports the hypothesis that argsort is dominant. For negation, I would guess you're right about the memory usage, but users may still prefer that if they care about the position of nan's and/or need a stable sort. – wim Apr 27 '20 at 16:58
10

Instead of using np.argsort you could use np.argpartition - if you only need the indices of the lowest/highest n elements.

That doesn't require to sort the whole array but just the part that you need but note that the "order inside your partition" is undefined, so while it gives the correct indices they might not be correctly ordered:

>>> avgDists = [1, 8, 6, 9, 4]
>>> np.array(avgDists).argpartition(2)[:2]  # indices of lowest 2 items
array([0, 4], dtype=int64)

>>> np.array(avgDists).argpartition(-2)[-2:]  # indices of highest 2 items
array([1, 3], dtype=int64)
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Or, if you are using the two together, that is argsort and argpartition, the operation has to be performed on the argpartition operation. – demongolem Apr 15 '19 at 18:47
  • Though this doesn't return indexes in descending order. The return should be [4, 0], [3,1] – Andrei R. Sep 29 '22 at 08:06
8

As @Kanmani hinted, an easier to interpret implementation may use numpy.flip, as in the following:

import numpy as np

avgDists = np.array([1, 8, 6, 9, 4])
ids = np.flip(np.argsort(avgDists))
print(ids)

By using the visitor pattern rather than member functions, it is easier to read the order of operations.

Adam Erickson
  • 6,027
  • 2
  • 46
  • 33
7

You can use the flip commands numpy.flipud() or numpy.fliplr() to get the indexes in descending order after sorting using the argsort command. Thats what I usually do.

johnjps111
  • 1,160
  • 9
  • 24
Kanmani
  • 479
  • 7
  • 21
4

You could create a copy of the array and then multiply each element with -1.
As an effect the before largest elements would become the smallest.
The indeces of the n smallest elements in the copy are the n greatest elements in the original.

MentholBonbon
  • 745
  • 4
  • 10
4

An elegant way could be as follows -

ids = np.flip(np.argsort(avgDists))

This will give you indices of elements sorted in descending order. Now you can use regular slicing...

top_n = ids[:n]
NiteshK
  • 101
  • 3
3

With your example:

avgDists = np.array([1, 8, 6, 9, 4])

Obtain indexes of n maximal values:

ids = np.argpartition(avgDists, -n)[-n:]

Sort them in descending order:

ids = ids[np.argsort(avgDists[ids])[::-1]]

Obtain results (for n=4):

>>> avgDists[ids]
array([9, 8, 6, 4])
Alexey Antonenko
  • 2,389
  • 1
  • 18
  • 18
1

consider order of equal elements

If you run a sorting routine and 2 elements are equal, the order is usually not changed. However, the flip/[::-1] approach changes the order of equal elements.

>>> arr = np.array([3, 5, 4, 7, 3])
>>> 
>>> np.argsort(arr)[::-1]
array([3, 1, 2, 4, 0])  # equal elements reorderd
>>> np.argsort(-arr)
array([3, 1, 2, 0, 4])  # equal elements not reorderd (compatible to other sorting)

For compatibility reasons I would hence prefer the argsort of the negative array approach. This is especially relevant, when arr represents some number representation of more complex elements.

Example:

obj = ['street', 'house', 'bridge', 'station', 'rails']
arr = np.array([3, 5, 4, 7, 3])  # cost of obj in coins

Disclaimer: A more common approach is to solve the example above with sorted(list_of_tuples_obj_cost, key=lambda x: x[1])

Markus Dutschke
  • 9,341
  • 4
  • 63
  • 58
-1

Another way is to use only a '-' in the argument for argsort as in : "df[np.argsort(-df[:, 0])]", provided df is the dataframe and you want to sort it by the first column (represented by the column number '0'). Change the column-name as appropriate. Of course, the column has to be a numeric one.