0

I am reading about Count-Min Sketch data structure which gives a probabilistic answer to point and range queries, based on error probability parameter and the tolerance parameter. For example, the question "how many times with probability of 10% did item x appear in the stream of data" could be answered by CM.

An associated problem of heavy hitters has also come up. While implementing a min heap for the HH problem, I have noticed various research papers specifying that only if the minimum count of an item in the sketch is greater than a threshold, do we insert into the heap.

My question is, does this mean we are probabilistically answering the heavy hitters problem? Would the corresponding question be "with probability of 10%, which item was the second most frequent in the stream of data?"

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
user3508140
  • 285
  • 2
  • 18
  • 1
    Usually, we answer the epsilon-Heavy hitters problem which is a bit relaxed version of the original problem and since the estimation of the frequencies can have errors and hash collisions, you can consider it as "probabilistic". If you are curious about such a class of data structures and want to learn more about Coun-Min sketch, take a look at my recently published book "Probabilistic Data Structures and Algorithms for Big Data Applications"(ISBN: 978-3748190486). https://pdsa.gakhov.com – gakhov Mar 08 '19 at 20:55

1 Answers1

0

From Wikipedia:

In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream.

More formally, fix some positive constant c > 1, let the length of the stream be m, and let fi denote the frequency of value i in the stream. The frequent elements problem is to output the set { i | fi > m/c }.

Some notable algorithms are:

  • Boyer–Moore majority vote algorithm
  • Karp-Papadimitriou-Shenker algorithm
  • Count-Min sketch
  • Sticky sampling
  • Lossy counting
  • Sample and Hold
  • Multi-stage Bloom filters
  • Count-sketch
  • Sketch-guided sampling

Event detection Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.

So, yes. CMS can be used to determine frequency (in an approximative manner), which can be used to answer the HH question.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54