0

I have a stream of int values arriving at a certain rate. Every 5 minutes, I'd like to compute some percentiles from the values, and start over.

The problem: I don't want to waste too much memory, so I'd like to keep only a few KBs for the values. If my buffer does not fill during 5 minutes, I can compute the percentiles perfectly. If the buffer does fill however, I'd like to start dropping some values (possibly using reservoir sampling and random eviction as suggested here - Percentiles of Live Data Capture). Unfortunately I can't find a solution that works well in both scenarios - if the buffer is not full, I don't want to evict or ignore values, and once it gets full and I start evicting I invariably introduce bias.

Community
  • 1
  • 1
user1424934
  • 153
  • 2
  • 6
  • What is the size of the buffer? – vidit Oct 26 '13 at 00:54
  • The size is configurable. Right now I have 10,000 ints = 40KB. I _could_ make it bigger but since I don't have any way of knowing how many values will be arriving - that might change considerably over time - every size I'll pick might not be enough. And simply throwing 10MB at it is too wasteful. – user1424934 Oct 26 '13 at 01:49

1 Answers1

0

OK I think I figured it out - I can use Algorithm R to uniformly select a fixed sized subset of the incoming elements. Then I can compute the percentiles from this subset.

user1424934
  • 153
  • 2
  • 6