Design a system to keep top k frequent words real time

Question

Suppose we want a system to keep top k frequent words appear in tweets in last one hour. How to design it?

I can come up with hashmap, heap, log or MapReduce but I cannot find a very efficient way to do this.

Actually it's a question in an interview.
First I used a hashmap to count the frequency of each word. Also I kept a log, so as time passing by, I could count down the oldest words frequency.
Then I kept an entry array with length K(Top K array) and a number N which is the smallest count number in the array.
Every time a new word comes, I update the counting hashmap and get the count number of this new word. If it's larger than N, I will find if this word is in the array. If it's, I update that entry in the array. If not, I delete the smallest entry in the array and insert this new word into it. (Update N accordingly)

Here is the problem, my approach cannot deal with the deleting. I may need to iterate the entire counting hashmap to find the new top K.
Also, as the interviewer said, the system should get the result very fast. I think of several machines work together and each machine takes some words. However, how to combine the results becomes a problem too.

sorry I should be more specific. I have edited my post, hoping you can come up with some good ideas. THX — Lampard, Feb 11 '14 at 04:47
I mean it's difficult to update entries that already in the heap. The frequency changes all the time. Say we have three entries in the max heap: <"the", 109>,<"a", 100>,<"I",99>. Now <"a",100> becomes <"a",111> and you need to heapify the heap. Also find <"a",100> in the heap also takes O(n) time. Heap is a good data structure but I don't think it's suitable for this case. — Lampard, Feb 11 '14 at 06:12
Keep *raw counts* in **hash table** and *top K* in a **heap** (fixed-size priority queue, to be precise). This is the key. This way if you find one more "a", you look up and update its previous count in the hash table in amortized O(1). If the word is not in priority queue yet, try to insert it there (it will be inserted if only word's count is larger than smallest count in the queue). If the word is already in the queue, just delete and insert it back (still in O(log n)). — ffriend, Feb 11 '14 at 07:53
Possible duplicate of [The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence](http://stackoverflow.com/questions/185697/the-most-efficient-way-to-find-top-k-frequent-words-in-a-big-word-sequence) — craftsmannadeem, Mar 15 '16 at 04:24

rici · Accepted Answer · 2014-02-12T05:12:29.800

If the words are not weighted (other than weights 0 and 1), then it is possible to derive a simple datastructure which maintains the word counts in order, using O(N) auxiliary storage where N is the number of unique words encountered in the sliding window (one hour, in the example). All operations (add a word, expire a word, look up the most frequent word) can be performed in O(1) time. Since any accurate solution needs to retain all the unique words in the sliding window, this solution is not asymptotically worse although the constant factor per word is not small.

The key to the solution is that the count for any given word can only be incremented or decremented by 1, and that all of the counts are integers. Consequently, it is possible to maintain a doubly-linked list of counts (in order) where each node in the list points to a double-linked list of words which have that count. In addition, each node in the word-list points back to the appropriate count node. Finally, we maintain a hashmap which allows us to find the node corresponding to a given word.

Finally, in order to decay the words at the end of their life, we need to retain the entire datastream from the sliding window, which has a size of O(N') where N' is the total number of words encountered during the sliding window. This can be stored as a singly-linked list where each node has a timestamp and a pointer to the unique word in the word-list.

When a word is encountered or expired, its count needs to be adjusted. Since the count can only be incremented or decremented by 1, the adjustment always consists in moving the word to the adjacent count-node (which may or may not exist); since the count-nodes are stored in a sorted linked list, the adjacent node can be found or created in time O(1). Furthermore, the most popular words (and counts) can always be traced in constant time by traversing the count list backwards from the maximum.

In case that was not obvious, here is a rough ascii art drawing of the datastructure at a given point in time:

Count list      word lists (each node points back to the count node)

  17            a <--> the <--> for
  ^
  |
  v
  12            Wilbur <--> drawing
  ^
  |
  v
  11            feature

Now, suppose we find a Wilbur. That will raise its count to 13; we can see from the fact that the success of 12 is not 13 that the 13 count node needs to be created and inserted into the count-list. After we do that, we remove Wilbur from its current word-list, put it into the newly-created empty word-list associated with the new count node, and change the count-pointer in Wilbur to point to the new count node.

Then, suppose that a use of drawing expires, so its new count will be 11. We can see from the fact that the predecessor of 12 is 11 that no new count node needs to be created; we simply remove drawing from its word-list and attach it to the word-list associate with 11, fixing its count-pointer as we do so. Now we notice that the word-list associated with 12 is empty, so we can remove the 12 count node from the count-list and delete it.

When the count for a word reaches 0, rather than attaching it to the 0 count node, which doesn't exist, we just delete the word node. And if a new word is encountered, we just add the word to the 1 count node, creating that count node if it doesn't exist.

In the worst case, every word has a unique count, so the size of the count-list cannot be greater than the number of unique words. Also, the total size of the word-lists is exactly the number of unique words because every word is in exactly one word-list, and fully-expired words don't appear in the word-lists at all.

--- EDIT

This algorithm is a bit RAM-hungry but it really shouldn't have any troubles holding an hour's worth of tweets. Or even a day's worth. And the number of unique words is not going to change much after a few days, even considering abbreviations and misspellings. Even so, it's worth thinking about ways to reduce the memory footprint and/or make the algorithm parallel.

To reduce the memory footprint, the easiest thing is to just drop words which are still unique after a few minutes. This will dramatically cut down on the unique word count, without altering the counts of popular words. Indeed, you could prune a lot more drastically without altering the final result.

To run the algorithm in parallel, individual words can be allocated to different machines by using a hash function to generate a machine number. (Not the same hash function as the one used to construct the hash tables.) Then the top k words can be found by merging the top k words from each machine; the allocation by hash guarantees that the set of words from each machine is distinct.

Sorry for the late response. I almost forget this question since I was very busy a month ago. I just read your solution. It's a great algorithm, thank you — Lampard, Mar 05 '14 at 09:42
"This can be stored as a singly-linked list where each node has a timestamp and a pointer to the unique word in the word-list." Is the pointer necessary when you can lookup the word in the hashmap? — yzernik, Nov 08 '15 at 09:12
@yzernik: the pointer is instead of keeping the word itself. So the node is only two pointers and a timestamp — rici, Nov 08 '15 at 13:04

score 2 · Answer 2 · edited May 23 '17 at 12:17

2

This set of problems is called data stream algorithms. In your particular case there are two that fit - "Lossy Counting" and "Sticky Sampling" This is the paper that explains them or this, with pictures. This is a more simplified introduction.

Edit: (too long, to fit into a comment)

Although these streaming algos do not discount expired data per-se, one can run for instance 60 sliding windows, one for each minute of the hour and then delete and create a new one every minute. The sliding window on top is used for queering, other for updates only. This gives you a 1m resolution.

Critiques says, that streaming algos are probabilistic, and would not give you exact count, while this is true, please compare for instance with Rici's algo here, one does control error frequency, and can make it very low if desired. As you stream grows you would want to set it in % from the stream size, rather than in absolute value.

Streaming algos are very memory efficient, which is the most important things when crunching large streams in real time. Compare with Rici's precise algo which requires a single host to keep all data in memory for the current sliding window. It might not scale well - increase rate 100/s -> 100k/s or time window size 1h -> 7d and you will run out of memory on a single host.

Hastables that are essential part of the Rici's algo require one continuous memory blob which becomes more and more problematic as they grow.

edited May 23 '17 at 12:17

Community

1
1

answered Feb 11 '14 at 05:15

Igor Katkov

6,290
1
16
17

The key to the OP is to find the most frequent items in the tail (one hour, in the example) of the data stream, bearing in mind that the frequency distribution "drifts" over time ("trending topics" appear and disappear, and the desire is to notice that). The referenced algorithms are approximations and do not discount expired data; they are not optimal for this particular problem. – rici Feb 11 '14 at 14:56
Updated answer, too long for the comment – Igor Katkov Feb 11 '14 at 22:14
@IgorKatkov: If you're talking about me, my name is rici. (Capitalization optional). Bucket-chain hashtables only require a contiguous blob of memory for the bucket heads; although many people like to have short chains, for a problem like this chains of dozens of words are reasonable. Also, the memory load is related to the number of unique words, which does not grow nearly as fast as the total traffic. Anyway, this is twitter we're talking about: 6k tweets/sec average, historic max 143K/sec; daily average 500M. If that's a problem, drop singletons every 10 minutes; no harm done. – rici Feb 12 '14 at 03:03
@Rici Don't get me wrong, I think that your algo is great if one needs to compute stats precisely. I did some ballpark calculation this morning on a piece of paper - assuming only 15% of unique words, it looks like you would need ~15MB of RAM per 100 words/s. 6k tweets/sec => 60k words/sec => ~8.8GB, make it 30GB and you have ordinary spikes covered. That is for 1h sliding window. What I'm saying this might not be the right algo for the question. You made a step in the right direction when suggested drop singletons every 10m – Igor Katkov Feb 12 '14 at 04:16
@IgorKatkov: yeah, I'm not complaining. I suspect that there are even fewer than 15% unique words, especially during spikes because of the number of retweets, which basically add no new words. But I no longer have access to a complete tweet feed, so I don't know. Certainly, you'd need a machine with a lot of ram (30GB is *not* a lot of RAM these days) and pruning the long tail would help a lot: I think you could be a lot more aggressive than just singletons. Cycles is not a problem; 2 million iterations per second wouldn't come close to even a single core, or RAM b/w. – rici Feb 12 '14 at 04:41
@IgorKatkov: Also, a word node contains four pointers and the word itself. The pointers could be indices instead, so I think you could get that down to around 32 bytes per node, but lets say you use 64 bytes. The number of count nodes is negligible, and the number of bucket heads could be held down to 10% of the number of nodes, also negligible. Suppose 500 million tweets (a day) have 4e9 unique words, which is really unlikely: these are words, not random strings. How many words in your language? That's 256GB. – rici Feb 12 '14 at 05:05
@IgorKatkov: edited to explain how to run in parallel – rici Feb 12 '14 at 05:13
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/47288/discussion-between-igor-katkov-and-rici) – Igor Katkov Feb 12 '14 at 06:07
@IgorKatkov say we go with the Lossy Counting streaming approach. How exactly (as in data structures, etc.) will you get the top K? – onepiece Oct 09 '19 at 19:20
Trie sounds good for words, but what if we are tracking something like songs, instead of just words? – onepiece Oct 09 '19 at 19:28

score 0 · Answer 3 · answered Feb 11 '14 at 07:16

Here is an algorithm which is fairly efficient for your purpose : -

firstly use a dictionary rather than hashmap to store the strings because it provides better space efficiency.

Map index in dictionary to hashmap for frequency.

Then maintain a min heap to store indices of k most frequent words.

Add pointer for each word which gives its location in the heap (-1 if not present).

If word frequency is updated then check if it exists in heap then use heapify on it using its direct location in the heap using pointers maintained alongside heap.

If word is not present and has frequency greater than top then delete top and insert word and update pointer of word in heap.

Time complexity : -

Updating top k :- O(logk) for heapify,insert,delete

Updating or Search words : O(|W|) where |W| is length of word

Space Complexity for heap : O(k)

Space for dictionary,HashMap,heap pointers :- O(N) N is total words

You don't seem to be considering the decrement/deletion operation necessary when a word reaches the one hour threshold. That produces the same problem as in the OP: over time, as word counts become stale, words drop out of the heap and the entire dataset needs to be sorted to find the replacement word. — rici, Feb 11 '14 at 14:59
@rici oops didnt notice thats right it doesnt consider that. — Vikram Bhat, Feb 11 '14 at 16:33

score -1 · Answer 4 · answered Feb 11 '14 at 03:46

-1

You can use a TreeMap which is basically a sorted hashmap. In java, you can make the TreeMap list out it's entries in descending order (by overriding the comparison method in the Comparable interface). In this case, the top k entries after the specified period of time would give you the result.

answered Feb 11 '14 at 03:46

ucsunil

7,378
1
27
32

1

counting the frequency of each word, we need a map, so the entries should be sorted by its value, not key. However, TreeMap can only sort entry by key. How can we solve this? – Lampard Feb 11 '14 at 04:19
The Comparable interface in the TreeMap by default sorts the keys in ascending order. Making the TreeMap sort by values would break the TreeMap paradigm. The details of sorting the TreeMap by values is in this post: http://stackoverflow.com/questions/2864840/treemap-sort-by-value – ucsunil Feb 11 '14 at 09:08
Sorting the TreeMap by value needs to be done *every time* you need to find the largest value, and will take time `O(N log N)` every time you do it (as well as requiring additional memory). That's definitely not an efficient solution to this problem. – rici Feb 11 '14 at 14:49
@rici - I don't think you read properly enough. I have clearly stated that sorting the TreeMap by values would break the TreeMap paradigm. geez... read the post and the comment properly before thumbing someone down. – ucsunil Feb 11 '14 at 17:50
@Sunil: I did read it, and I read the post you link to in your comment. The OP says "the system should get the result very fast." Sorting the entire map's values is `O(N log N)`, which is not normally considered "very fast" (Actually, you only need to do quickselect on it to get the top K values. But that's still `O(N)`.) – rici Feb 11 '14 at 20:55
@Lampard Using TreeSet + HashMap can get O(N logk) Time Complexity https://stackoverflow.com/a/66869979/10969942 – maplemaple Mar 30 '21 at 11:18
@rici Using TreeSet + HashMap can get O(N logk) Time Complexity stackoverflow.com/a/66869979/10969942 – maplemaple Mar 30 '21 at 11:19

maplemaple · Answer 5 · 2021-03-31T00:07:21.413

Update:

Just design a data structure, efficiently support add(String word), remove(String word), List<String> currentTopK(int k)

currentTopK return the top k frequent words, if there is a tire, using alphabetical order

Similar to design LRU, this can be done with Doubly Linked List + HashMap

add() and remove()

Time: O(1); If there is a tier(multiple words has same counts as input word) O(log m) where m is number of words with same count, since I used TreeSet to keep alphabetical order. If tier doesn't matter, I can return any order with same count, then I don't need to use TreeSet, and we can get guarantee O(1)

currentTopK(int k)

Time O(k)

public class TopK {
    private class Node {
        int count;
        TreeSet<Item> itemSet = new TreeSet<>();
        Node prev, next;
        Node(int count){
            this.count = count;
        }
    }

    private class Item implements Comparable<Item> {
        String word;
        Node countNode;

        Item(String w, Node node) {
            word = w;
            countNode = node;
        }

        @Override
        public int compareTo(Item o) {
            return this.word.compareTo(o.word);
        }
    }

    private final Node head = new Node(0);
    private Map<String, Item> countMap;

    public TopK(){
        head.next = head;
        head.prev = head;
        countMap = new HashMap<>();
    }

    public void add(String word) {
        Item item = countMap.get(word);
        Node countNode = item == null? null : item.countNode;
        if (countNode == null) {
            if (head.next.count == 1) {
                countNode = head.next;
            } else {
                countNode = new Node(1);
                insertNode(head, countNode);
            }
            item = new Item(word, countNode);
            countMap.put(word, item);
        } else {
            Node oldCountNode = countNode;
            if (oldCountNode.next.count == oldCountNode.count + 1) {
                countNode = oldCountNode.next;
            } else {
                countNode = new Node(oldCountNode.count + 1);
                insertNode(oldCountNode, countNode);
            }
            oldCountNode.itemSet.remove(item);
            if (oldCountNode.itemSet.isEmpty()) removeNode(oldCountNode);
            item.countNode = countNode;
        }
        countNode.itemSet.add(item);

    }

    public void remove(String word) {
        Item item = countMap.get(word);
        if (item == null) return;
        Node countNode = item.countNode;
        if (countNode.count == 1) {
            countNode.itemSet.remove(item);
            countMap.remove(word);
        } else {
            Node oldCountNode = countNode;
            if (oldCountNode.prev.count == oldCountNode.count - 1) {
                countNode = oldCountNode.prev;
            } else {
                countNode = new Node(oldCountNode.count - 1);
                insertNode(oldCountNode.prev, countNode);
            }
            oldCountNode.itemSet.remove(item);
            if (oldCountNode.itemSet.isEmpty()) removeNode(oldCountNode);
            item.countNode = countNode;
            countNode.itemSet.add(item);
        }
    }

    public List<String> currentTopK(int k){
        List<String> res = new ArrayList<>(k);
        Node cur = head.prev;
        while (cur != head) {
            for (Item item : cur.itemSet){
                res.add(item.word);
                if (res.size() == k) return res;
            }
            cur = cur.prev;
        }
        return res;
    }

    private void insertNode(Node prev, Node cur) {
        cur.next = prev.next;
        prev.next.prev = cur;
        prev.next = cur;
        cur.prev = prev;
    }

    private void removeNode(Node cur) {
        Node prev = cur.prev;
        Node next = cur.next;
        prev.next = next;
        next.prev = prev;
        cur.prev = null;
        cur.next = null;
    }
}

Old Answer

Use HashMap + TreeSet initialization Time O(1)

add one new word: Time O(logk)

optput current Top K Time O(k)

processing data steam with length N: Time Complexity: O(N log k) Space Complexity: O(# of different words in data steam) <= O(N)

import java.util.*;

public class TopK {
    private final int k;
    private Map<String, Integer> counts;
    private TreeSet<String> topk;
    private Comparator<String> comp;


    public TopK(int k) {
        this.k = k;
        counts = new HashMap<>();
        comp = (w1, w2) -> {
            int c1 = counts.getOrDefault(w1, 0), c2 = counts.getOrDefault(w2, 0);
            return c1 == c2 ? w2.compareTo(w1) : c1 < c2 ? -1 : 1;
        };
        topk = new TreeSet<>(comp);
    }

    public void add(String word) {
        int newCount = counts.getOrDefault(word, 0) + 1;
        if (topk.size() < k) {
            topk.remove(word);
            counts.put(word, newCount);
            topk.add(word);
        } else {
            if (topk.remove(word)) {
                counts.put(word, newCount);
                topk.add(word);
            } else {
                counts.put(word, newCount);
                if (comp.compare(word, topk.first()) > 0) {
                    topk.pollFirst();
                    topk.add(word);
                }
            }
        }
    }

    public List<String> currentTopK() {
        return new ArrayList<>(topk.descendingSet());
    }
}

That algorithm has lots of problems but even if it worked as advertised, it wouldn't solve the problem in this particular question. The question here is "find the k most frequent words seen *in the last hour*." The point is to avoid scanning all of the words received in the last hour every time a query is made. — rici, Mar 30 '21 at 15:52
@rici I've updated my answer. Just need to design a data structure to efficiently support `add` `remove` `currentTopK(int k)`. add remove can be realized by time O(1) — maplemaple, Mar 31 '21 at 00:08
@rici To handle the data stream, just maintain a queue of tuple (word, timestamp). Every time read a new word, push into queue. Whenever the head of queue is out of time range, then pop and remove this word from the TopK data structure. — maplemaple, Mar 31 '21 at 00:11
Ok, now explain how that differs from the algorithm I proposed seven years ago. — rici, Mar 31 '21 at 01:00

Design a system to keep top k frequent words real time

5 Answers5

Linked