If the words are not weighted (other than weights 0 and 1), then it is possible to derive a simple datastructure which maintains the word counts in order, using O(N) auxiliary storage where N
is the number of unique words encountered in the sliding window (one hour, in the example). All operations (add a word, expire a word, look up the most frequent word) can be performed in O(1)
time. Since any accurate solution needs to retain all the unique words in the sliding window, this solution is not asymptotically worse although the constant factor per word is not small.
The key to the solution is that the count for any given word can only be incremented or decremented by 1, and that all of the counts are integers. Consequently, it is possible to maintain a doubly-linked list of counts (in order) where each node in the list points to a double-linked list of words which have that count. In addition, each node in the word-list points back to the appropriate count node. Finally, we maintain a hashmap which allows us to find the node corresponding to a given word.
Finally, in order to decay the words at the end of their life, we need to retain the entire datastream from the sliding window, which has a size of O(N')
where N'
is the total number of words encountered during the sliding window. This can be stored as a singly-linked list where each node has a timestamp and a pointer to the unique word in the word-list.
When a word is encountered or expired, its count needs to be adjusted. Since the count can only be incremented or decremented by 1, the adjustment always consists in moving the word to the adjacent count-node (which may or may not exist); since the count-nodes are stored in a sorted linked list, the adjacent node can be found or created in time O(1)
. Furthermore, the most popular words (and counts) can always be traced in constant time by traversing the count list backwards from the maximum.
In case that was not obvious, here is a rough ascii art drawing of the datastructure at a given point in time:
Count list word lists (each node points back to the count node)
17 a <--> the <--> for
^
|
v
12 Wilbur <--> drawing
^
|
v
11 feature
Now, suppose we find a Wilbur
. That will raise its count to 13; we can see from the fact that the success of 12
is not 13
that the 13
count node needs to be created and inserted into the count-list. After we do that, we remove Wilbur
from its current word-list, put it into the newly-created empty word-list associated with the new count node, and change the count-pointer in Wilbur
to point to the new count node.
Then, suppose that a use of drawing
expires, so its new count will be 11. We can see from the fact that the predecessor of 12
is 11
that no new count node needs to be created; we simply remove drawing
from its word-list and attach it to the word-list associate with 11
, fixing its count-pointer as we do so. Now we notice that the word-list associated with 12
is empty, so we can remove the 12
count node from the count-list and delete it.
When the count for a word reaches 0, rather than attaching it to the 0
count node, which doesn't exist, we just delete the word node. And if a new word is encountered, we just add the word to the 1
count node, creating that count node if it doesn't exist.
In the worst case, every word has a unique count, so the size of the count-list cannot be greater than the number of unique words. Also, the total size of the word-lists is exactly the number of unique words because every word is in exactly one word-list, and fully-expired words don't appear in the word-lists at all.
--- EDIT
This algorithm is a bit RAM-hungry but it really shouldn't have any troubles holding an hour's worth of tweets. Or even a day's worth. And the number of unique words is not going to change much after a few days, even considering abbreviations and misspellings. Even so, it's worth thinking about ways to reduce the memory footprint and/or make the algorithm parallel.
To reduce the memory footprint, the easiest thing is to just drop words which are still unique after a few minutes. This will dramatically cut down on the unique word count, without altering the counts of popular words. Indeed, you could prune a lot more drastically without altering the final result.
To run the algorithm in parallel, individual words can be allocated to different machines by using a hash function to generate a machine number. (Not the same hash function as the one used to construct the hash tables.) Then the top k
words can be found by merging the top k
words from each machine; the allocation by hash guarantees that the set of words from each machine is distinct.