1

I was checking out this solution to a question on leetcode.com

def topKFrequent(self, words, k):
        count = collections.Counter(words)
        heap = [(-freq, word) for word, freq in count.items()]
        heapq.heapify(heap)
        return [heapq.heappop(heap)[1] for _ in xrange(k)]

and when I provide it an array of strings like ["aa", "aaa", "a"] and 1 it correctly returns ["a"]. My question is did the heap also lexographically sort the tuples internally? Because according to me if there was no sorting, it would have simply returned ["aa"] (the order in which the heap was constructed since counts of all three are the same). Or have I misunderstood heapq?

Saturnian
  • 1,686
  • 6
  • 39
  • 65

4 Answers4

3

You have a heap of integer/string pairs, and so it is ordered based on the definition of < for tuples, which takes into account both elements of each type.

Given ["aa", "aaa", "a"], count.items() is sequence of tuples [('aa', 1), ('aaa', 1), ('a', 1)]. You then build a heap using the list of tuples

[(-1, 'aa'), (-1, 'aaa'), (-1, 'a')]

Since the first element of each tuple is the same, the comparisons are determined solely by the second, string, element.

chepner
  • 497,756
  • 71
  • 530
  • 681
2

heapq just compares values from the queue using using the "less than" operator [1] regardless of the type of the value. It is the type of the value that defines what the comparison will return. So, what makes the difference here is the tuple itself. As from the documentation:

The comparison [of sequence objects] uses lexicographical ordering: first the first two items are compared, and if they differ this determines the outcome of the comparison; if they are equal, the next two items are compared, and so on, until either sequence is exhausted.

Checking some examples:

>>> (0, 'a') < (1, 'aa')
True
>>> (1, 'a') < (1, 'aa')
True
>>> (1, 'aa') < (1, 'a')
False
>>> (2, 'a') < (1, 'aa')
False

So you are right, the values are ordered lexicographically and the second value of the tuple is relevant. However, heapq does not have to do anything here to get this result, the mere tuple comparison does that.

[1] One can check it in the code. Here is one of the lines where the comparison is made by heapq (in C):

cmp = PyObject_RichCompareBool(newitem, parent, Py_LT);

This PyObject_RichCompareBool() is, according to the documentation:

the equivalent of the Python expression o1 op o2, where op is the operator corresponding to opid.

brandizzi
  • 26,083
  • 8
  • 103
  • 158
  • More on tuple comparisons [here](https://stackoverflow.com/questions/5292303/how-does-tuple-comparison-work-in-python). – brandizzi Feb 19 '20 at 18:13
  • Great answer. It's a pity that the API docs themselves don't shed much light on how tuples in a heapq are actually compared. – Holger Brandl Nov 22 '20 at 09:51
0

Heaps are partial orderings. They are not sorted. You can, however, build sorts out of them by storing values in a heap and pulling them out one at a time. These sorts are not stable, because heaps do not try to preserve the ordering of "equal" values.

Here's another kind of Python heap you might be interested in: https://pypi.org/project/fibonacci-heap-mod/

dstromberg
  • 6,954
  • 1
  • 26
  • 27
  • Though it's worth pointing out that heaps obey the [heap property](https://xlinux.nist.gov/dads/HTML/heapproperty.html), which a sorted list satisfies. A heap doesn't *have* to be sorted, but it *can* be. – chepner Feb 19 '20 at 17:20
0

The expectation of the leetcode question is to solve the problem in O(nlogk). So we have to keep only 'k' elements in the heap at any time, which means we have to use "minHeap" (freq, word) and not (-freq,word).

We want 'minHeap' to keep the 'minimum frequency' and 'max lexicographical' value at the top of the heap. That is tricky, because by default it would keep 'minimum frequency' and 'min lex'.

The only solution is to create an object that can have 'freq' and 'word' and override the 'lt' method to do this

def __lt__(self, other):
    if self.c == other.c:
        return self.w > other.w
    return self.c < other.c