I'm trying to solve the Top K Frequent Words Leetcode problem in O(N log K) time and am getting an undesirable result. My Python3 code and console output are below:
from collections import Counter
import heapq
class Solution:
def topKFrequent(self, words: List[str], k: int) -> List[str]:
counts = Counter(words)
print('Word counts:', counts)
result = []
for word in counts:
print('Word being added:', word)
if len(result) < k:
heapq.heappush(result, (-counts[word], word))
print(result)
else:
heapq.heappushpop(result, (-counts[word], word))
result = [r[1] for r in result]
return result
----------- Console output -----------
Word counts: Counter({'the': 3, 'is': 3, 'sunny': 2, 'day': 1})
Word being added: the
[(-3, 'the')]
Word being added: day
[(-3, 'the'), (-1, 'day')]
Word being added: is
[(-3, 'is'), (-1, 'day'), (-3, 'the')]
Word being added: sunny
[(-3, 'is'), (-2, 'sunny'), (-3, 'the'), (-1, 'day')]
When I run the test case ["the", "day", "is", "sunny", "the", "the", "sunny", "is", "is"]
with K = 4
, I find that the word the
gets shifted to the end of the list (after day
) once is
is added even though they both have a count of 3. This makes sense since the parent need only be <= the children and the children are not ordered in any way. Since (-2, 'sunny')
and (-3, 'the')
are both > (-3, 'is')
, the heap invariant is, in fact, maintained even though (-3, 'the')
< (-2, 'sunny')
and is the right child of (-3, 'is')
. The expected result is ["is","the","sunny","day"]
while the output of my code is ["is","sunny","the","day"]
.
Should I be using heaps to solve this problem in O(N log K) time, and if so, how can I modify my code to achieve the desired result?