I have an input stream, of size n, and I want to produce an output stream of size k that contains distinct random elements of the input stream, without requiring any additional memory for elements selected by the sample.
The algorithm I was going to use is basically as follows:
for each element in input stream
if random()<k/n
decrement k
output element
if k = 0
halt
end if
end if
decrement n
end for
The function random() generates a number from [0..1) on a random distribution, and I trust the algorithm's principle of operation is straightforward.
Although this algorithm can terminate early when it selects the last element, in general the algorithm is still approximately O(n). At first it seemed to work as intended (outputting roughly uniformly distributed but still random elements from the input stream), but I think there may be a non-uniform tendency to pick later elements when k is much less than n. I'm not sure about this, however... so I'd appreciate knowing for sure one way or the other. I'm also wondering if a faster algorithm exists. Obviously, since k elements must be generated, the algorithm cannot be any faster than O(k). For an O(k) solution, one could assume the existence of a function skip(x), which can skip over x elements in the input stream in O(1) time (but cannot skip backwards). I would still like to keep the requirement of not requiring any additional memory, however.