6

I'm trying to determine the fastest runtime for getting k (key,value) pairs based on the smallest k keys in a dictionary. i.e.: for

mynahs = {40:(1,3),5:(5,6),11:(9,2),2:(6,3),300:(4,4),15:(2,8)}

smallestK(mynahs,3)

would return:

[(2,(6,3)),(5,(5,6)),(11,(9,2))]

I've seen a few different ways to do this:
1.

mylist = list(mynahs.keys())
mylist.sort
mylist = mylist[:k]
return [(k, mynahs[k]) for k in mylist]

but everyone seems to think heapq is the fastest

cheap = heapq.nsmallest(3, mynahs)
return [(k, mynahs[k]) for k in cheap]

How does heapq.nsmallest work and why is it fastest? I have seen this question and this one I still don't understand. Is heapq using a minheap to get the nsmallest? How does that work? I've also heard about an algorithm called quickselect, is that what it's using?

What's the runtime of it? If the dictionary is constantly changing/updating, is calling heapq.nsmallest each time you need the nsmallest the fastest way to do that?

singmotor
  • 3,930
  • 12
  • 45
  • 79

2 Answers2

7

The code for heapq.py is available at https://svn.python.org/projects/python/trunk/Lib/heapq.py

nsmallest uses one of two algorithms. If the number of items to be returned is more than 10% of the total number of items in the heap, then it makes a copy of the list, sorts it, and returns the first k items.

If k is smaller than n/10, then it uses a heap selection algorithm:

Make a copy of the first k items, and sort it
for each remaining item in the original heap
    if the item is smaller than the largest item in the new list
        replace the largest item with the new item
        re-sort the new list

That whoever wrote this used such an inefficient algorithm is somewhat surprising. In theory at least, Quick select, which is an O(n) algorithm, should be faster than sorting, and much faster than the "optimized" algorithm for selecting n/10 items.

I'm not a Python guy, so I can't say for sure, but my experience with other languages indicates that the above should be true for Python as well.

Update

The implementation at https://github.com/python/cpython/blob/master/Lib/heapq.py#L395 works somewhat differently.

If the k is greater than or equal to the number of items in the list, then a sorted list containing all of the elements is returned. Otherwise, it uses a standard heap selection algorithm:

create a max heap from the first k items
for each remaining item
    if the item is smaller than the largest item on the heap
        remove the largest item from the heap
        add the new item to the heap
sort the resulting heap and return

The remove/add is combined into a single function called heap_replace.

There's an optimization in there to use the standard comparator if the key is None, but it uses the same basic heap selection algorithm.

This implementation is much more efficient than the other one that I described, although I expect it to be slower than Quickselect in the general case.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
0

heapq uses a a heap ( _heapify_max )

Here is the implementation for heapq.nsmallest - https://github.com/python/cpython/blob/master/Lib/heapq.py#L395

Also look at:

tuxdna
  • 8,257
  • 4
  • 43
  • 61