3

I need to store top k most frequent elements in a stream. To estimate the frequency i use count-min-sketch algorithms. My stream is consist of keys(as string). So basically every time i encounter a new key in my stream, i calculate the frequency of current key so far by looking into the count-min-sketch data structure. However i am unable to store the top k most frequent keys.

My first idea was store them in a min-heap with fixed size of k. And i store [frequency, key] inside this min-heap with comparator comnpare frequency. So every time i get a frequency of a key, i try to see if the heap size is over k, if it is, then i compare frequency of current key with the top(smallest) frequency in min-heap, if my current key's frequency is larger then i pop the top, and insert my key into the heap.

However i realize the min-heap is not a set, which means it allows duplication. Let's say i have a very hot key, and i keep counting it in the stream, so every time i insert this [frequency, key] into the heap, eventually my heap will be full of the same key, only different frequencies.

So i am wondering is there a good way to store the top k distinct more frequent element in count-min-sketch.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
BYsBro
  • 65
  • 1
  • 3

2 Answers2

2

You can also maintain all three data-structures 1. Count-min sketch to store everything you encounter in the stream 2. Min heap of size k 3. Hash map of size k

In case of hot item - you increment the count and get the new frequency from count-min sketch, assuming this item already exists in min-heap, you get the item from hash-map and increase the frequency

When you encounter a different item whose frequency just increased and got itself in the prestigious min-heap, you can evict the root from min-heap and hash-map simultaneously so basically min-heap helps you to maintain top k-frequent items and hash-map to randomly access those frequent items. note that both min-heap and hash-map can map to the same memory address so updating frequency can only be done to the item stored in hash-map

Code Crack
  • 21
  • 2
  • How can you do that with Priority Queue of Java . there is no way to share the same memory address. Either you hve to build your own minHepa. – MrSham Sep 07 '21 at 08:16
1

Makes sense to maintain a hashmap of [key,<frequency,position>] pairs already in the heap. position refers to the index of the key within the heap (assuming array-based heap). When the key arrives, you check 2 conditions:
-the key is in the hashmap
-its frequency has changed

If both are true, you find the key within the heap in O(1) time since you already have its position stored in the hashmap, then modify the key's frequency and depending on whether the frequency increased or decreased, you do bubble-down or bubble-up from that position (O(logn)). After altering the position, you update the hashmap entry for that key with new frequency and position values.

If 1st one is false, you follow a usual logic, i.e. compare the key with root, if heap is full and the key's frequency is larger than the root's frequency then you pop the root out of heap and remove it from hashmap, and meanwhile insert the key into the heap and hashmap.

If key is in the hashmap but its frequency has not been changed you do nothing.

mangusta
  • 3,470
  • 5
  • 24
  • 47
  • 1
    thx, that's a good idea. Java's priority Queue give no index after operation. So i stuck with priorityQueue. But you remind me i can implement a binary-heap with index support myself. – BYsBro Nov 01 '18 at 07:43