0

So the whole point of Count-Min Sketch is to update certain counters depending on the results of the provided hash functions. But, these counters are limited in memory, and after running for quite some time, they might overflow, dropping from the MAX value to the MIN value (like Integers do). Assuming all I need is the N most frequent values in the sketch, is there a way to avoid this other than restarting the sketch every once in a while?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
shakedzy
  • 2,853
  • 5
  • 32
  • 62

1 Answers1

1

Use an appropriately sized integer if this worries you.

An 8-byte (long long) unsigned integer has a maximum value of 18,446,744,073,709,551,615. That should probably be enough.

Edit

Assuming all I need is the N most frequent values in the sketch, is there a way to avoid this other than restarting the sketch every once in a while?

Perhaps you could adapt reservoir sampling to your needs.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • Thanks, but let's assume I have very limited memory - is there another method? – shakedzy Mar 07 '18 at 19:58
  • 1
    Edited my answer: reservoir sampling might do what you want. But an extra byte or two would give you an elegant, simple solution; you'd have to be very byte-strapped to ignore that possibility. – Richard Mar 07 '18 at 20:02
  • The link you posted refers to a question about random sampling - I need the most frequent, unless I'm missing something here – shakedzy Mar 07 '18 at 20:10
  • 1
    @shakedzy: The most frequent items are most likely to appear in that sample. Given the number of items seen and the number of distinct items in the set, you could figure out the probability of an item appearing in that sample. – Richard Mar 07 '18 at 20:22