Find Top K Frequent Words In A Big Word Stream

Question

given a stream of unknown size of words I need to find the k most frequent words at any given time, and the program should be able to run for a long time.

I've seen a post that is similar to the solution that I'm thinking of
The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
How to find the most frequent element in a stream of numbers with additions and deletions allowed
but my problem is how I handle out of memory issue
in the posts, they don't mention this issue.

so I'm planning to store a hash with key as a word and value as word-frequency and a min heap of size k that will hold the most frequent words
and for each word, I'll increase the count in the hash and check if I can enter it to the heap

the issue is that the program should be able to run for a long time so there is a possibility that I'll get different words with a small frequency
and I need to save them in the hash just in case that the frequency will increase later on the stream
and this will create a memory problem because with time the hash can really big

I wonder if the only solution is to delete certain words
(but this will change the count)

I'm not sure how to handle that, any suggestion?

What are the actual constraints? No limit to number of words, no limit to word length and no limit to stream length? Then I'm pretty sure you'd need unlimited memory to solve this. — domen, May 13 '19 at 08:49
A quick test case to show this: stream starts with aaaaaaaaaa, aaaaaaaaab, etc.... and after 10 GB outputs one of the previous words. — domen, May 13 '19 at 08:56

Yola · Answer 1 · 2019-05-13T09:49:01.193

3

You can use trie. And store number of occurrences so far.

The following tree corresponds to the following input:

ABL ABO AC AC AC ACR ACR

Then finishing current word, you increment the corresponding counter and check if it is bigger than one the smallest in the k-most frequent list, if yes - replace.

So, you need O(1) time to handle every incoming character.

edited May 13 '19 at 09:49

answered May 13 '19 at 09:15

Yola

18,496
11
65
106

Definitely the right approach. Another option using the trie would be to first build it, then BFS/DFS it in order to find the `k` highest values in its nodes instead of maintaining a list of the `k` best. Chances are that it would be cheaper overall. – m.raynal May 13 '19 at 10:41

Find Top K Frequent Words In A Big Word Stream

1 Answers1