8

What is the fastest method to get the k smallest numbers in an unsorted list of size N using python?

Is it faster to sort the big list of numbers, and then get the k smallest numbers, or to get the k smallest numbers by finding the minimum in the list k times, making sure to remove the found minimum from the search before the next search?

trincot
  • 317,000
  • 35
  • 244
  • 286
jsky
  • 2,225
  • 5
  • 38
  • 54

6 Answers6

16

You could use a heap queue; it can give you the K largest or smallest numbers out of a list of size N in O(NlogK) time.

The Python standard library includes the heapq module, complete with a heapq.nsmallest() function ready implemented:

import heapq

k_smallest = heapq.nsmallest(k, input_list)

Internally, this creates a heap of size K with the first K elements of the input list, then iterating over the remaining N-K elements, pushing each to the heap, then popping off the largest one. Such a push and pop takes log K time, making the overall operation O(NlogK).

The function also optimises the following edge cases:

  • If K is 1, the min() function is used instead, giving you a O(N) result.
  • If K >= N, the function uses sorting instead, since O(NlogN) would beat O(NlogK) in that case.

A better option is to use the introselect algorithm, which offers an O(n) option. The only implementation I am aware of is using the numpy.partition() function:

import numpy

# assuming you have a python list, you need to convert to a numpy array first
array = numpy.array(input_list)
# partition, slice back to the k smallest elements, convert back to a Python list
k_smallest = numpy.partition(array, k)[:k].tolist()

Apart from requiring installation of numpy, this also takes N memory (versus K for heapq), as a copy of the list is created for the partition.

If you only wanted indices, you can use, for either variant:

heapq.nsmallest(k, range(len(input_list)), key=input_list.__getitem__)  # O(NlogK)
numpy.argpartition(numpy.array(input_list), k)[:k].tolist()  # O(N)
Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    If you scroll down, there's a quick discussion on doing it in O(n) time. – David Ehrmann Nov 10 '15 at 05:28
  • @DavidEhrmann: nope, because selecting the kth smallest item is glossed over a little too quickly there. That's what the heap queue does! – Martijn Pieters Nov 10 '15 at 05:32
  • Selection of the kth element in an unordered list can be done in n time and constant memory. The heap queue does neither. – David Ehrmann Nov 10 '15 at 05:43
  • because im actually interested in what the indexes are of the k smallest numbers in the initial list, (which i didnt specify), can i do this with the heapq? – jsky Nov 10 '15 at 06:14
  • 1
    @jsky: you could, using `heapq.nsmallest(k, range(len(input_list)), key=inputlist.__getitem__)`. – Martijn Pieters Nov 10 '15 at 06:19
  • 1
    @jsky: or use [`numpy.argpartition()`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.argpartition.html) for a O(n) introselect option: `numpy.argpartition(numpy.array(input_list), k)[:k].tolist()`. – Martijn Pieters Nov 10 '15 at 06:19
  • the argpartition and argsort methods only gave me elements, but the heapq.nsmallest gave me indices. thanks – jsky Nov 10 '15 at 07:13
6

If the list of the kth smallest numbers doesn't need to be sorted, this can be done in O(n) time with a selection algorithm like introselect. The standard library doesn't come with one, but NumPy has numpy.partition for the job:

partitioned = numpy.partition(l, k)
# The subarray partitioned[:k] now contains the k smallest elements.
user2357112
  • 260,549
  • 28
  • 431
  • 505
3

You might want to take a look at heapq:

In [109]: L = [random.randint(1,1000) for _ in range(100)]

In [110]: heapq.nsmallest(10, L)
Out[110]: [1, 17, 17, 19, 24, 37, 37, 45, 63, 73]
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
3

EDIT: this assumes that the list is immutable. If the list is an array and can be modified there are linear methods available.

You can get the complexity down to O(n * log k) by using a heap of size k + 1.

  1. Initially get the first k elements into a min-heap.
  2. For every subsequent element, add the element as a leaf and heapify.
  3. Replace the last element with the next element.

Heapify can be done in logarithmic time and hence the time complexity is as above.

user1952500
  • 6,611
  • 3
  • 24
  • 37
2

You can do it in O(kn) with a selection algorithm. Once kn >= n log n, switch to sorting. That said, the constant on the selection algorithm tends to be a lot higher than the one on quicksort, so you really need to compare i (kn) and j (n log n). In practice, it's usually more desirable to just sort unless you're dealing with large n or very small k.

Edit: see comments. It's actually a lot better.

David Ehrmann
  • 7,366
  • 2
  • 31
  • 40
  • I don't think map-reduce will be any quicker than the max-heap solution. – OneCricketeer Nov 10 '15 at 05:12
  • Sorting takes O(NlogN), a heap queue gives you O(NlogK), and only takes N + K memory for the lists (one for the input list, one for the heap), plus the original N elements in the list (Python just copies across the references to the heap). – Martijn Pieters Nov 10 '15 at 05:15
  • 1
    The selection algorithm doesn't take `O(kn)` unless you're running `k` selections. It's much quicker to select the kth-smallest element and then return a list of everything smaller than that. – user2357112 Nov 10 '15 at 05:21
  • @user2357112: ah, right. Good call. In that case, you can do it in O(n)? – David Ehrmann Nov 10 '15 at 05:23
  • 1
    Yup, it just takes O(n). – user2357112 Nov 10 '15 at 05:23
  • @user2357112: not quite. The devil is in the details there; how do you select the kth smallest item? – Martijn Pieters Nov 10 '15 at 05:31
  • 2
    @MartijnPieters: With [introselect](https://en.wikipedia.org/wiki/Introselect), perhaps using the [NumPy implementation](http://docs.scipy.org/doc/numpy/reference/generated/numpy.partition.html). Introselect is O(n). – user2357112 Nov 10 '15 at 05:33
  • @user2357112: interesting, had not heard of introselect before. The numpy implementation looks like it would double the memory requirements though, as `partition` basically returns a copy of full array, partitioned, where you'd only need the first k elements. – Martijn Pieters Nov 10 '15 at 05:42
  • @user2357112: related, then, is [Worst-case O(n) algorithm for doing k-selection](http://stackoverflow.com/q/7358912) and [How to get indices of N maximum values in a numpy array?](http://stackoverflow.com/q/6910641) – Martijn Pieters Nov 10 '15 at 05:45
1

Using nsmallest numbers in heapq is less code but if you are looking to implement it yourself this is a simple way to do it. This solution requires looping through the data once only but since heappush and heappop run on O(log n) this algorithm would perform best on smaller numbers of k.

import heapq

def getsmallest(arr, k):
    m = [-x for x in l[:k]]
    heapq.heapify(m)
    for num in arr[5:]:
        print num, m
        heapq.heappush(m, max(-num, heapq.heappop(m)))
    return m

if __name__ == '__main__':
    l = [1,2,3,52,2,3,1]
    print getsmallest(l, 5)
pilatipus
  • 91
  • 1
  • 4