1

To make clear following is the question:

Given an input stream of an indeterminate length, how do you return a random member of that stream (with equal probability for each), given that you're not allowed to store more than a constant number of inputs, and you can only go through the inputs once

The solution to this problem seems to be Reservoir Sampling and it states below. "First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.

Next, you want to process the i'th element (starting with i = 1,001) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the i elements you've seen so far. How can you do this? Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001."

I am unable to understand the last sentence "The answer is easy: 1,000/1,001". Shouldn't be the probability of finding 1 element in array of 1001 elements be 1/1001 and not 1000/1001 ? Isn't the Sample space equal to 1001 and favorable number of outcomes equal to 1 ?

hippietrail
  • 15,848
  • 18
  • 99
  • 158

2 Answers2

1

There are 1,001 elements. 1,000 of them are in the sample. One is outside of the sample. So the probability that a particular element is the one which is outside is 1 out of 1,001, and the probability that it is one of the thousand elements inside the sample is 1,000 out of 1,001.

rici
  • 234,347
  • 28
  • 237
  • 341
0

I find the following argument more lucid. Let S be the set of the first 1000 elements; let e denote the last element in the stream (e.g. the 1001st one). There are {1001 choose 1000}=1001 possible size-1000 subsets of a set of 1001 elements, and you want all of them to have the same probability of being stored in the data structure (this invariant should hold each time a new element arrives).

What is the number of size-1000 subsets of 1001 elements that contain e? Well, since e is fixed we have 1000 elements left to choose from, and we'll choose 999 elements, thus there are {1000 choose 999} = 1000 such subsets.

The probability that e is in S should hence be: {1000 choose 999} / {1001 choose 1000} = 1000/1001 (that is the number of size-1000 subsets that contain e divided by the number of all size-1000 subsets).

By {n choose k} I denote the binomial coefficient.

blazs
  • 4,705
  • 24
  • 38