0

Is there an algorithm such as Reservoir Sampling (algorithm that randomly chooses an item from a stream such that each item is equally likely to be selected), but once an item is selected it is yielded (and therefore cannot be overridden)? i.e, not only can't I hold the stream in memory, I can't hold the sample in memory either. I need a Python solution, but an algorithm name is enough.

The code I currently have (taken from https://towardsdatascience.com/the-5-sampling-algorithms-every-data-scientist-need-to-know-43c7bc11d17c):

k=5
reservoir = []
for i, element in enumerate(stream):
    if i+1<= k:
        reservoir.append(element)
    else:
        probability = k/(i+1)
        if random.random() < probability:
            # Select item in stream and remove one of the k items already selected
             reservoir[random.choice(range(0,k))] = element # This is why I can't currently do it - elements in the list are overridden
Stack Overflow
  • 377
  • 4
  • 16
  • So you want to return a random item with uniform probability from a stream you can only iterate through once? Do you know the length of the stream in advance then? – Berthur Nov 14 '22 at 12:26
  • 2
    You can't do that without knowing the length of the stream in advance. The probability that any particular item will be in the output has to decrease as the stream length increases. – Matt Timmermans Nov 14 '22 at 12:38
  • 2
    Can you elaborate on why you can’t hold items in memory / why you can’t replace them? That’s a hard restriction to work around. – templatetypedef Nov 14 '22 at 15:19
  • It sounds like what you want is to sample a stream to turn it into a smaller stream, which presumably some downstream process will handle. If correct, this is easy. Just set your probability p (which will also be your scaling factor), and sample whenever rand() < p. – Dave Nov 14 '22 at 22:26

0 Answers0