What is the time complexity of heapq.nlargest?

Question

I was looking at this pycon talk, 34:30 and the speaker says that getting the t largest elements of a list of n elements can be done in O(t + n).

How is that possible? My understanding is that creating the heap will be O(n), but what's the complexity of nlargest itself, is it O(n + t) or O(t) (and what's the actual algorithm)?

You might be interested in [the source code](http://hg.python.org/cpython/file/3.4/Lib/heapq.py#l195). — lvc, Apr 13 '14 at 03:36
If you want it in sorted order, obviously that's not going to happen in linear time. Otherwise, you could call `nlargest` with `t=n` to comparison sort a list in linear time. If you just want the `t` largest elements in *any* order, that can be done in O(n) with [quickselect](http://en.wikipedia.org/wiki/Quickselect). `heapq.nlargest` doesn't use quickselect, though; it gives the items in sorted order with a heap-based algorithm. — user2357112, Apr 13 '14 at 03:51
Just a general note: The claim that it takes time O(t + n) itself strikes me as wary, because that is just O(n). It's not technically incorrect but somewhat strange to express it that way — Niklas B., Apr 13 '14 at 04:17

score 55 · Accepted Answer · edited Jun 05 '18 at 17:53

55

The speaker is wrong in this case. The actual cost is O(n * log(t)). Heapify is called only on the first t elements of the iterable. That's O(t), but is insignificant if t is much smaller than n. Then all the remaining elements are added to this "little heap" via heappushpop, one at a time. That takes O(log(t)) time per invocation of heappushpop. The length of the heap remains t throughout. At the very end, the heap is sorted, which costs O(t * log(t)), but that's also insignificant if t is much smaller than n.

Fun with Theory ;-)

There are reasonably easy ways to find the t'th-largest element in expected O(n) time; for example, see here. There are harder ways to do it in worst-case O(n) time. Then, in another pass over the input, you could output the t elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done in O(n) time.

But those ways require O(n) memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden is O(t), and that can be very significant when the input is, for example, a generator producing a great many values.

edited Jun 05 '18 at 17:53

wim

338,267
99
616
750

answered Apr 13 '14 at 03:35

Tim Peters

67,464
13
126
132

1

Great that makes sense; I was really hoping `O(t + n)` was right though, I thought I'd learn about some new heap wizardry :) – foo Apr 13 '14 at 03:48
4

See the edit just now for an O(n) method - but it has nothing to do with heaps, alas. – Tim Peters Apr 13 '14 at 03:56
4

Fun fact: You *can* in fact heapify the array in O(n) and fetch the top-k of the resulting heap in O(k) time per query. It's highly non-trivial though and the `heapq` module does not implement it. (It also probably has gigantic constant factors that make it infeasible in practice) – Niklas B. Apr 13 '14 at 04:25
1

@NiklasB. where can I read about this `O(k)` algorithm? Even if non-trivial I'm super interested! – foo Apr 13 '14 at 15:47
2

@foo http://stackoverflow.com/questions/22574580/algorithm-for-finding-the-largest-k-numbers-in-a-max-heap-of-size-n-in-ok-time – Niklas B. Apr 13 '14 at 15:51
@NiklasB. my guess is that will require `O(n+t)` in space vs the `heapq` which just uses `O(t)` in space, when `t< – user881300 Apr 03 '16 at 00:57
I think adding remaining elements to the "little heap" costs O((n-t)log(t)), and the sorting at the very end takes O(t log(t)). Neither is ignored at last. They sum up to O(n log(t)). Am I right? – Yan Yang May 23 '19 at 23:17

Manu Manoj · Answer 2 · 2021-10-25T05:56:01.357

For Heapq t largest or t smallest, the time complexity will be O(nlog(t))

Heapq will build the heap for the first t elements, then later on it will iterate over the remaining elements by pushing and popping the elements from the heap (maintaining the t elements in the heap).

For building the heap for first t elements will be done tlog(t)
For pushing and popping, the remaining elements will be done in (n-t)log(t)
The overall time complexity will be nlog(t)

score 2 · Answer 3 · answered Jun 28 '21 at 14:00

2

It's actually O(n+tlog(n)) because heapify takes O(n) and for every element of largest or smallest takes O(log(n)). So for t largest/smallest it takes tlog(n). Therefore time complexity will be O(n+t*log(n))

answered Jun 28 '21 at 14:00

Venkata Sri Harsha

29
1

score 2 · Answer 4 · answered Jun 12 '22 at 13:36

According to the comments in the heapq source code:

# Algorithm notes for nlargest() and nsmallest()
# ==============================================
#
# Make a single pass over the data while keeping the k most extreme values
# in a heap.  Memory consumption is limited to keeping k values in a list.
#

# Theoretical number of comparisons for k smallest of n random inputs:
#
# Step   Comparisons                  Action
# ----   --------------------------   ---------------------------
#  1     1.66 * k                     heapify the first k-inputs
#  2     n - k                        compare remaining elements to top of heap
#  3     k * (1 + lg2(k)) * ln(n/k)   replace the topmost value on the heap
#  4     k * lg2(k) - (k/2)           final sort of the k most extreme values
#
# Combining and simplifying for a rough estimate gives:
#
#        comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k))

What is the time complexity of heapq.nlargest?

4 Answers4

Fun with Theory ;-)

Linked