1

I am in love with probabilistic data structures. For my current problem, it seems that the count-min-sketch structure is almost the right candidate. I want to use count-min-sketch to store events per ID.

Let's assume I do have the following

Map<String, Int> {
   [ID1, 10],
   [ID2, 12],
   [ID2, 15]
}

If I use a count-min-sketch, I can query the data structure by IDs and retrieve the ~counts.

Question

Actually I am interested in the average occurrence over all IDs, which in the example above would be: 12,33. If I am using the count-min then it seems that I need to store the Set of IDs and then iterate over the set and query the count-min for each ID and calculate the average. Is there an improved way without storing all IDs? Ideally I just want to retrieve the average right away without remembering all IDs.

Hope that make sense!?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Fritz
  • 872
  • 1
  • 8
  • 17

1 Answers1

1

You should be able to calculate the average count if you know the number of entries, and the number of distinct entries:

averageCount = totalNumberOfEntries / numberOfDistinctEntries

Right? And to calculate the number of distinct entries, you can use e.g. HyperLogLog. You already added the hyperloglog tag to your question, so maybe you already know this?

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132