41

I was looking at this pycon talk, 34:30 and the speaker says that getting the t largest elements of a list of n elements can be done in O(t + n).

How is that possible? My understanding is that creating the heap will be O(n), but what's the complexity of nlargest itself, is it O(n + t) or O(t) (and what's the actual algorithm)?

Viktoriya Malyasova
  • 1,343
  • 1
  • 11
  • 25
foo
  • 940
  • 2
  • 9
  • 20
  • 1
    You might be interested in [the source code](http://hg.python.org/cpython/file/3.4/Lib/heapq.py#l195). – lvc Apr 13 '14 at 03:36
  • 1
    If you want it in sorted order, obviously that's not going to happen in linear time. Otherwise, you could call `nlargest` with `t=n` to comparison sort a list in linear time. If you just want the `t` largest elements in *any* order, that can be done in O(n) with [quickselect](http://en.wikipedia.org/wiki/Quickselect). `heapq.nlargest` doesn't use quickselect, though; it gives the items in sorted order with a heap-based algorithm. – user2357112 Apr 13 '14 at 03:51
  • 6
    Just a general note: The claim that it takes time O(t + n) itself strikes me as wary, because that is just O(n). It's not technically incorrect but somewhat strange to express it that way – Niklas B. Apr 13 '14 at 04:17

4 Answers4

55

The speaker is wrong in this case. The actual cost is O(n * log(t)). Heapify is called only on the first t elements of the iterable. That's O(t), but is insignificant if t is much smaller than n. Then all the remaining elements are added to this "little heap" via heappushpop, one at a time. That takes O(log(t)) time per invocation of heappushpop. The length of the heap remains t throughout. At the very end, the heap is sorted, which costs O(t * log(t)), but that's also insignificant if t is much smaller than n.

Fun with Theory ;-)

There are reasonably easy ways to find the t'th-largest element in expected O(n) time; for example, see here. There are harder ways to do it in worst-case O(n) time. Then, in another pass over the input, you could output the t elements >= the t-th largest (with tedious complications in case of duplicates). So the whole job can be done in O(n) time.

But those ways require O(n) memory too. Python doesn't use them. An advantage of what's actually implemented is that the worst-case "extra" memory burden is O(t), and that can be very significant when the input is, for example, a generator producing a great many values.

wim
  • 338,267
  • 99
  • 616
  • 750
Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 1
    Great that makes sense; I was really hoping `O(t + n)` was right though, I thought I'd learn about some new heap wizardry :) – foo Apr 13 '14 at 03:48
  • 4
    See the edit just now for an O(n) method - but it has nothing to do with heaps, alas. – Tim Peters Apr 13 '14 at 03:56
  • 4
    Fun fact: You *can* in fact heapify the array in O(n) and fetch the top-k of the resulting heap in O(k) time per query. It's highly non-trivial though and the `heapq` module does not implement it. (It also probably has gigantic constant factors that make it infeasible in practice) – Niklas B. Apr 13 '14 at 04:25
  • 1
    @NiklasB. where can I read about this `O(k)` algorithm? Even if non-trivial I'm super interested! – foo Apr 13 '14 at 15:47
  • 2
    @foo http://stackoverflow.com/questions/22574580/algorithm-for-finding-the-largest-k-numbers-in-a-max-heap-of-size-n-in-ok-time – Niklas B. Apr 13 '14 at 15:51
  • @NiklasB. my guess is that will require `O(n+t)` in space vs the `heapq` which just uses `O(t)` in space, when `t< – user881300 Apr 03 '16 at 00:57
  • I think adding remaining elements to the "little heap" costs O((n-t)log(t)), and the sorting at the very end takes O(t log(t)). Neither is ignored at last. They sum up to O(n log(t)). Am I right? – Yan Yang May 23 '19 at 23:17
7

For Heapq t largest or t smallest, the time complexity will be O(nlog(t))

Heapq will build the heap for the first t elements, then later on it will iterate over the remaining elements by pushing and popping the elements from the heap (maintaining the t elements in the heap).

  1. For building the heap for first t elements will be done tlog(t)
  2. For pushing and popping, the remaining elements will be done in (n-t)log(t)
  3. The overall time complexity will be nlog(t)
Manu Manoj
  • 189
  • 2
  • 6
2

It's actually O(n+tlog(n)) because heapify takes O(n) and for every element of largest or smallest takes O(log(n)). So for t largest/smallest it takes tlog(n). Therefore time complexity will be O(n+t*log(n))

2

According to the comments in the heapq source code:

# Algorithm notes for nlargest() and nsmallest()
# ==============================================
#
# Make a single pass over the data while keeping the k most extreme values
# in a heap.  Memory consumption is limited to keeping k values in a list.
#

# Theoretical number of comparisons for k smallest of n random inputs:
#
# Step   Comparisons                  Action
# ----   --------------------------   ---------------------------
#  1     1.66 * k                     heapify the first k-inputs
#  2     n - k                        compare remaining elements to top of heap
#  3     k * (1 + lg2(k)) * ln(n/k)   replace the topmost value on the heap
#  4     k * lg2(k) - (k/2)           final sort of the k most extreme values
#
# Combining and simplifying for a rough estimate gives:
#
#        comparisons = n + k * (log(k, 2) * log(n/k) + log(k, 2) + log(n/k))
tozCSS
  • 5,487
  • 2
  • 34
  • 31