22

I'm sorry in advance if this is a duplicated question, I looked for this information but still couldn't find it.

Is it possible to arrange a numpy array (or python list) by using the indexes of the N biggest elements in decreasing order very efficiently?

For instance, the array:

a = array([4, 1, 0, 8, 5, 2])

The indexes of the biggest elements in decreasing order would give (considering N = 6, all the elements are included):

8 --> 3

5 --> 4

4 --> 0

2 --> 5

1 --> 1

0 --> 2

result = [3, 4, 0, 5, 1, 2]

I know how to make it using a somewhat silly approach (like sorting the array and searching for each of the N numbers for their indexes), but I was wondering if is there any efficient library like bottleneck or heapq or maybe a pythonic approach to make this very fast. I have to apply it in several arrays with 300k elements each so that's why performance is an issue.

Thanks in advance!

UPDATE

I read the answers and decided to timeit them using a 300k of random integers, here are the results:

solution 1: sorted(range(len(a)), key=lambda i:a[i]) time: 230 ms

solution 2: heapq.nlargest(len(a), zip(a, itertools.count())) time: 396 ms

solution 3: heapq.nlargest(len(a), enumerate(a), key=operator.itemgetter(1)) time: 864 ms

solution 4: def f(a,N): return np.argsort(a)[::-1][:N] (N = len(a)) time: 104 ms

Thanks a lot for the fast and very good answers!

inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
Willian Fuks
  • 11,259
  • 10
  • 50
  • 74
  • possible duplicate of [How to get the N maximum values in a numpy array?](http://stackoverflow.com/questions/6910641/how-to-get-the-n-maximum-values-in-a-numpy-array) – senderle Oct 08 '12 at 20:07
  • 1
    If you follow the trail of duplicates, [this answer](http://stackoverflow.com/a/10489712/577088) pops up, which seems promising -- although the post is by the developer, a fact that the answer does not disclose... – senderle Oct 08 '12 at 20:09
  • In your test, what is the value of N ? As explained above, using heapq is efficient is N is rather small compare to len(a). – lizzie Mar 13 '13 at 15:08
  • How do you modify these for `N < len(a)`? – Superbest Apr 27 '15 at 11:04
  • I agree with @lizzie. Can you provide the value of `N` and `len(a)` in your experiment? I think `heapq.nlargest` should be more efficient than `np.argsort` if `N` is much smaller than `len(a)`. – stanleyli Jul 20 '16 at 21:37
  • seem the https://gist.github.com/tinix84/27a001611f1e8b8aadc33ee797ef7193 – tinix84 Feb 07 '20 at 12:46

4 Answers4

22

Have you looked at the built-in numpy argsort method?:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

I can sort an array with 300,000 random floats in about 29 ms on my machine using that method.

def f(a,N):
    return np.argsort(a)[::-1][:N]
halex
  • 16,253
  • 5
  • 58
  • 67
JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • This worked amazingly well! In my machine it's taking 104 ms (it's very busy right now), later on I'll try it again, but so far this has been the fastest solution. Tnx! – Willian Fuks Oct 08 '12 at 20:26
  • 1
    @joshadel This function argsort's first and then returns the top-N values. Is there a Numpy/Scipy function that is equivalent to the Python heapq.nlargest(N, a) but for finding the top-N index values without argsort'ing the entire array? – Henry Thornton Oct 11 '12 at 17:42
  • @dbv maybe something like http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.argpartsort, but that seems to only work for the smallest values. – JoshAdel Oct 12 '12 at 02:36
11
L = [4, 1, 0, 8, 5, 2]
sorted(range(len(L)), key=lambda i:L[i])
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
6

You can use heapq to do this easily enough:

>>> heapq.nlargest(3, zip(a, itertools.count()))
[(8, 3), (5, 4), (4, 5)]

Tuples are sorted by sorting on the first value, then the second, etc... This means that we can simply make a tuple of (value, index) and sort, giving us the indices of the values (the values are also given, but we can easily throw these away).

I am using zip() and itertools.count() as enumerate gives us the wrong order, so they will be sorted by index, rather than by value. Alternatively, you could also do ((value, index) for index, value in enumerate(a)), but I feel that is less clear.

Another alternative is to give a key, doing heapq.nlargest(3, enumerate(a), key=operator.itemgetter(1)).

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • The python documentation recommends using sorted() instead of heapq.nlargest() when working with large lists, although they don't clarify how big a "large list" is. http://docs.python.org/library/heapq.html – Matt Oct 08 '12 at 19:21
  • @Matt I can't find the suggestion saying that in the docs - but that may well be the case - I'd suggest the OP runs some ``timeit`` tests to find out what's the most efficient for his use. – Gareth Latty Oct 08 '12 at 19:25
  • @Lattyware At docs.python.org/library/heapq.html it says "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function. Also, when n==1, it is more efficient to use the built-in min() and max() functions." – Matt Oct 08 '12 at 23:42
1

Another way to use heapq

heapq.nlargest(n, range(len(a)), key=a.__getitem__)

As commented elsewhere, it won't beat sorting unless a is very large and n<<len(a) because sorting is a relatively fast operation in Python. However eventually a slow O(n) will always beat the O(n*log(n))

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • 1
    Yes, you are correct that slow `O(n)` beats `O(n*log(n))` for large `n`, but `heapq` module has been implemented intelligently, taking care that the `n` value passed to `nlargest()` function will use heap only if n is relatively small and switch to sorting when `n` is significantly larger and tending to sizeof list. – Piyush Deshmukh Dec 13 '15 at 09:58
  • nlargest from at least v2.7.11 will use a heap regardless, but v3.5.2 behaves as you describe @sinister – Joffer Aug 20 '16 at 13:54
  • actually, in v3.5.2, sorted is only used when n is larger than the size of the iterable @sinister – Joffer Aug 21 '16 at 12:57