A fast way to find the largest N elements in an numpy array

Question

I know I can do it like the following:

import numpy as np
N=10
a=np.arange(1,100,1)
np.argsort()[-N:]

However, it is very slow since it did a full sort.

I wonder whether numpy provide some methods the do it fast.

Possible duplicate of [How to get indices of N maximum values in a numpy array?](http://stackoverflow.com/questions/6910641/how-to-get-indices-of-n-maximum-values-in-a-numpy-array) — Seanny123, Feb 07 '16 at 14:55

score 88 · Answer 1 · answered Nov 24 '13 at 17:50

88

numpy 1.8 implements partition and argpartition that perform partial sort ( in O(n) time as opposed to full sort that is O(n) * log(n)).

import numpy as np

test = np.array([9,1,3,4,8,7,2,5,6,0])

temp = np.argpartition(-test, 4)
result_args = temp[:4]

temp = np.partition(-test, 4)
result = -temp[:4]

Result:

>>> result_args
array([0, 4, 8, 5]) # indices of highest vals
>>> result
array([9, 8, 6, 7]) # highest vals

Timing:

In [16]: a = np.arange(10000)

In [17]: np.random.shuffle(a)

In [18]: %timeit np.argsort(a)
1000 loops, best of 3: 1.02 ms per loop

In [19]: %timeit np.argpartition(a, 100)
10000 loops, best of 3: 139 us per loop

In [20]: %timeit np.argpartition(a, 1000)
10000 loops, best of 3: 141 us per loop

answered Nov 24 '13 at 17:50

Akavall

82,592
51
207
251

5

Note that may be helpful to others: The example is not the best choice, since the result is not guaranteed to be in order – c2huc2hu Mar 21 '17 at 19:36
4

@user3080953. I never say the result is guaranteed to be in order, that's what partial sort is. And in the example I provide: `[9, 8, 6, 7]` it is clear that n highest vals are not in order. – Akavall Mar 21 '17 at 22:47
1

yup, in hindsight, it's obvious because you can't sort in O(n). I spent 20 minutes looking for a bug, and thought this might be helpful for other people reading this – c2huc2hu Mar 21 '17 at 23:29
2

@user3080953. Try set "kth" as a sequence, as noted in the doc of numpy.argpartition -- "If provided with a sequence of k-th it will partition all of them into their sorted position at once." And, the example following the doc -- >>> x = np.array([3, 4, 2, 1]) >>> x[np.argpartition(x, 3)] array([2, 1, 3, 4]) >>> x[np.argpartition(x, (1, 3))] array([1, 2, 3, 4]) https://docs.scipy.org/doc/numpy/reference/generated/numpy.argpartition.html – caoanan May 22 '18 at 03:17
Can we also get an explanation why the array is inverted during the `argpartition`? Shouldn't it be inherently the same, but with the selection on `temp[:5]` instead of `temp[4:]` then? Or am I missing a crucial detail here? – dennlinger Sep 10 '18 at 10:45
4

@dennlinger By 'inverted', are you referring to `-test`? `np.partition` sort in increasing order by default. In order to sort in descending order, we can turn all the numbers into negative (`array([-9, -1, -3, -4, -8, -7, -2, -5, -6, 0])`) and sort in that array instead. – Lala La Mar 30 '19 at 18:33

score 48 · Accepted Answer · edited Jul 23 '17 at 03:47

48

The bottleneck module has a fast partial sort method that works directly with Numpy arrays: bottleneck.partition().

Note that bottleneck.partition() returns the actual values sorted, if you want the indexes of the sorted values (what numpy.argsort() returns) you should use bottleneck.argpartition().

I've benchmarked:

z = -bottleneck.partition(-a, 10)[:10]
z = a.argsort()[-10:]
z = heapq.nlargest(10, a)

where a is a random 1,000,000-element array.

The timings were as follows:

bottleneck.partition(): 25.6 ms per loop
np.argsort(): 198 ms per loop
heapq.nlargest(): 358 ms per loop

edited Jul 23 '17 at 03:47

John Zwinck

239,568
38
324
436

answered Apr 26 '12 at 16:35

NPE

486,780
108
951
1,012

2

@Mike Graham: Thanks for the edit, but `nanargmax()` does something rather different to what the OP is asking. I'm going to roll back the edit. Correct me if I'm missing something. – NPE Apr 26 '12 at 16:41
Probably bottleneck is faster, but since it is not provided in EPD7.1, we may not use that. – Hailiang Zhang Apr 26 '12 at 17:12
@HailiangZhang: I too would love to see `bottleneck` added to EPD. – NPE Apr 26 '12 at 18:46
3

For the record, `bottleneck.partsort()` and `np.argsort()` are doing two slightly different things. They return a value and an index respectively. If you want bottleneck to return the index, use `bottleneck.argpartsort` – Alex McLean Dec 22 '15 at 17:56
1

The timing on `heapq.nlargest` isn't quite fair. It would be better to run `heapq.nlargest(10, a.tolist())` – piRSquared Dec 25 '16 at 00:56

score 19 · Answer 3 · answered May 08 '17 at 14:28

I had this problem and, since this question is 5 years old, I had to redo all benchmarks and change the syntax of bottleneck (there is no partsort anymore, it's partition now).

I used the same arguments as kwgoodman, except the number of elements retrieved, which I increased to 50 (to better fit my particular situation).

I got these results:

bottleneck 1: 01.12 ms per loop
bottleneck 2: 00.95 ms per loop
pandas      : 01.65 ms per loop
heapq       : 08.61 ms per loop
numpy       : 12.37 ms per loop
numpy 2     : 00.95 ms per loop

So, bottleneck_2 and numpy_2 (adas's solution) were tied. But, using np.percentile (numpy_2) you have those topN elements already sorted, which is not the case for the other solutions. On the other hand, if you are also interested on the indexes of those elements, percentile is not useful.

I added pandas too, which uses bottleneck underneath, if available (http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies). If you already have a pandas Series or DataFrame to start with, you are in good hands, just use nlargest and you're done.

The code used for the benchmark is as follows (python 3, please):

import time
import numpy as np
import bottleneck as bn
import pandas as pd
import heapq

def bottleneck_1(a, n):
    return -bn.partition(-a, n)[:n]

def bottleneck_2(a, n):
    return bn.partition(a, a.size-n)[-n:]

def numpy(a, n):
    return a[a.argsort()[-n:]]

def numpy_2(a, n):
    M = a.shape[0]
    perc = (np.arange(M-n,M)+1.0)/M*100
    return np.percentile(a,perc)

def pandas(a, n):
    return pd.Series(a).nlargest(n)

def hpq(a, n):
    return heapq.nlargest(n, a)

def do_nothing(a, n):
    return a[:n]

def benchmark(func, size=1000000, ntimes=100, topn=50):
    t1 = time.time()
    for n in range(ntimes):
        a = np.random.rand(size)
        func(a, topn)
    t2 = time.time()
    ms_per_loop = 1000000 * (t2 - t1) / size
    return ms_per_loop

t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(pandas)
t4 = benchmark(hpq)
t5 = benchmark(numpy)
t6 = benchmark(numpy_2)
t0 = benchmark(do_nothing)

print("bottleneck 1: {:05.2f} ms per loop".format(t1 - t0))
print("bottleneck 2: {:05.2f} ms per loop".format(t2 - t0))
print("pandas      : {:05.2f} ms per loop".format(t3 - t0))
print("heapq       : {:05.2f} ms per loop".format(t4 - t0))
print("numpy       : {:05.2f} ms per loop".format(t5 - t0))
print("numpy 2     : {:05.2f} ms per loop".format(t6 - t0))

thanks for the code! I also tested np.argpartition and found that it's 10x slower than np.argmax, when argpartition is set to find the top 1 element. — Shaohua Li, Mar 27 '18 at 05:26

score 11 · Answer 4 · answered May 05 '12 at 15:02

Each negative sign in the proposed bottleneck solution

-bottleneck.partsort(-a, 10)[:10]

makes a copy of the data. We can remove the copies by doing

bottleneck.partsort(a, a.size-10)[-10:]

Also the proposed numpy solution

a.argsort()[-10:]

returns indices not values. The fix is to use the indices to find the values:

a[a.argsort()[-10:]]

The relative speed of the two bottleneck solutions depends on the ordering of the elements in the initial array because the two approaches partition the data at different points.

In other words, timing with any one particular random array can make either method look faster.

Averaging the timing across 100 random arrays, each with 1,000,000 elements, gives

-bn.partsort(-a, 10)[:10]: 1.76 ms per loop
bn.partsort(a, a.size-10)[-10:]: 0.92 ms per loop
a[a.argsort()[-10:]]: 15.34 ms per loop

where the timing code is as follows:

import time
import numpy as np
import bottleneck as bn

def bottleneck_1(a):
    return -bn.partsort(-a, 10)[:10]

def bottleneck_2(a):
    return bn.partsort(a, a.size-10)[-10:]

def numpy(a):
    return a[a.argsort()[-10:]]

def do_nothing(a):
    return a

def benchmark(func, size=1000000, ntimes=100):
    t1 = time.time()
    for n in range(ntimes):
        a = np.random.rand(size)
        func(a)
    t2 = time.time()
    ms_per_loop = 1000000 * (t2 - t1) / size
    return ms_per_loop

t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(numpy)
t4 = benchmark(do_nothing)

print "-bn.partsort(-a, 10)[:10]: %0.2f ms per loop" % (t1 - t4)
print "bn.partsort(a, a.size-10)[-10:]: %0.2f ms per loop" % (t2 - t4)
print "a[a.argsort()[-10:]]: %0.2f ms per loop" % (t3 - t4)

Akavall · Answer 5 · 2013-07-01T18:33:31.857

8

Perhaps heapq.nlargest

import numpy as np
import heapq

x = np.array([1,-5,4,6,-3,3])

z = heapq.nlargest(3,x)

Result:

>>> z
[6, 4, 3]

If you want to find the indices of the n largest elements using bottleneck you could use bottleneck.argpartsort

>>> x = np.array([1,-5,4,6,-3,3])
>>> z = bottleneck.argpartsort(-x, 3)[:3]
>>> z
array([3, 2, 5]

edited Jul 01 '13 at 18:33

answered Apr 26 '12 at 16:34

Akavall

82,592
51
207
251

But heap q is actually slower (also mentioned by the next reply). – Hailiang Zhang Apr 26 '12 at 17:14

score 2 · Answer 6 · answered Dec 05 '15 at 22:44

You can also use numpy's percentile function. In my case it was slightly faster then bottleneck.partsort():

import timeit
import bottleneck as bn

N,M,K = 10,1000000,100

start = timeit.default_timer()
for k in range(K):
    a=np.random.uniform(size=M)
    tmp=-bn.partsort(-a, N)[:N]
stop = timeit.default_timer()
print (stop - start)/K

start = timeit.default_timer()
perc = (np.arange(M-N,M)+1.0)/M*100
for k in range(K):
    a=np.random.uniform(size=M)
    tmp=np.percentile(a,perc)
stop = timeit.default_timer()
print (stop - start)/K

Average time per loop:

bottleneck.partsort(): 59 ms
np.percentile(): 54 ms

Note that percentile might interpolate some values by default. If you want _exactly_ the same values as in the input array you can add the argument `interpolation='nearest'` to the call to `np.percentile`. See [documentation](https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html) for more details. — freidrichen, Aug 22 '17 at 07:15

score 1 · Answer 7 · answered Apr 26 '12 at 16:35

1

If storing the array as a list of numbers isn't problematic, you can use

import heapq
heapq.nlargest(N, a)

to get the N largest members.

answered Apr 26 '12 at 16:35

Mike Graham

73,987
14
101
130

A fast way to find the largest N elements in an numpy array

7 Answers7

Linked

Related