constant memory reservoir sampling, O(k) possible?

Question

I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.

The algorithm I was going to use is basically as follows:

for each element in input stream
    if random()<k/n
        decrement k
        output element
        if k = 0
            halt
        end if
    end if
    decrement n
end for

The function random() generates a number from [0..1) on a random distribution, and I trust the algorithm's principle of operation is straightforward.

Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). At first it seemed to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I think there may be a non-uniform tendency to pick later elements when k is much less than n. I'm not sure about this, however... so I'd appreciate knowing for sure one way or the other. I'm also wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). For an O(k) solution, one could assume the existence of a function skip(x), which can skip over x elements in the input stream in O(1) time (but cannot skip backwards). I would still like to keep the requirement of not requiring any additional memory, however.

Just selecting *one* random element from a length-n stream is `O(n)`. — internet_user, Apr 27 '18 at 15:06
You could use a [Monte Carlo analysis](https://en.wikipedia.org/wiki/Monte_Carlo_method) to reassure yourself that the algorithm is working properly. — Mark Ransom, Apr 27 '18 at 18:16
See the second part of [this answer](https://stackoverflow.com/a/26478943/1566221) and specifically the link to a paper by Jeffrey Scott Visser (which apparently still works). — rici, Apr 27 '18 at 19:15
Implementation matters. Make sure you're generating randoms with enough precision. `r — Gene, Apr 28 '18 at 01:58

btilly · Answer 1 · 2018-04-27T19:26:47.950

0

If it is a real stream, you need O(n) time to scan it.

Your existing algorithm is good. (I got that wrong before.) You can prove by induction that the probability that you have not picked the first element in i tries is 1 - i/n = (n-i)/n. First that is true for i=0 by inspection. Now if you have not picked it in ith tries, the odds that the next one picks it is 1/(n-i). And then the odds of picking it on the i+1'th try is ((n-i)/n) * (1/(n-i)) = 1/n. Which means that the odds of not picking it in the first i+1 times is 1 - i/n - 1/n = 1 - (i+i)/n. That completes induction. And so the odds of picking the first element in the first k tries is the odds of not having not picked it, or 1 - (n - k/n) = k/n.

But what if you have O(1) access to any element? Well note that choosing k to take is the same as choosing n-k to leave. So without loss of generality we can assume that k <= n/2. What that means is that we can use a randomized algorithm like this:

chosen = set()
count_chosen = 0
while count_chosen < k:
    choice = random_element(stream)
    if choice not in chosen:
        chosen.add(choice)
        count_chosen = count_chosen + 1

The set will be O(k) space, and since the probability of each random choice being new to you is at least 0.5, the expected running time is no worse than 2k choices.

edited Apr 27 '18 at 19:26

answered Apr 27 '18 at 16:36

btilly

43,296
3
59
88

I was asking for a solution that takes O(1) space, actually. Although you raise a good point about needing O(n) to scan the input stream. What if a second function existed to skip over elements on the stream in O(1), however? For my purposes, such a seek-like function could not scan backwards, however... only forwards... and one could assume that each element in the stream takes up the same space, so we could arbitrarily seek to any one beyond the current. I have adjusted the question to accomodate this. – markt1964 Apr 27 '18 at 16:42
@markt1964 you can simply spin and generate random numbers until you get a hit, then seek the stream to skip the count of numbers you had to generate. – Mark Ransom Apr 27 '18 at 16:51
Additionally, it seems that the probability calculation you are using is wrong. If one had 7 elements, and needed to pick 6 of them, by your calculation, the probability of picking the first element is 1/7+1/6+1/5+1/4+1/3+1/2, which is 1.59, and makes no sense as a probability score. – markt1964 Apr 27 '18 at 17:06
@MarkRansom If by "until you get a hit", you mean until I get one I haven't got before, that would entail knowing which ones I had before, which would take space. – markt1964 Apr 27 '18 at 17:14
@markt1964 no, by "get a hit" I mean a positive result for `if random() – Mark Ransom Apr 27 '18 at 18:04
@MarkRansom Okay... that's still O(n), however... because I'm needing to still generate one random number for every entry in the stream... the fact that I'm not reading them in doesn't change the complexity. – markt1964 Apr 27 '18 at 18:08
@markt1964 You are right about the probability score. `k/n` is in fact right. The first number picked has probability `1/n`. The second has probability `1/(n-1)` in the `1-1/n = (n-1)/n` probability case that you weren't first picked for `1/n`. The third has probability `1/(n-2)` in the `1 - 1/n - 1/n = (n-2)/n` case that you weren't one of the first two picked, for a probability of `((n-2)/n) * (1/(n-2)) = 1/n`. And so on. – btilly Apr 27 '18 at 18:25

constant memory reservoir sampling, O(k) possible?

1 Answers1