1

I am having trouble understanding the probabilities involved in reservoir sampling. Below is the sample code which I have seen used in almost everywhere:

    1/*
    2  S has items to sample, R will contain the result, K number of items to select
    3*/
    4ReservoirSample(S[1..n], R[1..k])
    5  // fill the reservoir array
    6  for i = 1 to k
    7      R[i] := S[i]
    8
    9 // replace elements with gradually decreasing probability
    10  for i = k+1 to n
    11    j := random(1, i)   // important: inclusive range
    12    if j <= k
    13        R[j] := S[i]

Is my understanding right(?): Suppose we have k=3 and input = [100, 200, 300, 400, 500] and i is currently at 500 index. Probability of 500 replacing 300 in the reservoir(which is 3 in size) = probability of 300 being selected in the reservoir * probability of 500 being selected which is only possible if the index returned by random function is less than or equal to 3 out of 5 choices = 1/3 * 3/5 = 1/5

hippietrail
  • 15,848
  • 18
  • 99
  • 158
noman pouigt
  • 906
  • 11
  • 25

1 Answers1

-1

Your understanding doesn't seem to be correct. The probability of 500 getting selected is nowhere related to the selection of 300.

You should check out the gfg link for the same. "How does this work?" section, which states the following:

Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n For every such stream item stream[i], we pick a random index from 0 to i and if the picked index is one of the first k indexes, we replace the element at picked index with stream[i]

To simplify the proof, let us first consider the last item. The probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k/n (the probability of picking one of the k items from a list of size n)

Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] = [k/(n-1)]*[(n-1)/n] = k/n.

Similarly, we can consider other items for all stream items from stream[n-1] to stream[k] and generalize the proof.

Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n]. The probability that an item from stream[0..k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered = [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n

Swati Garg
  • 995
  • 1
  • 10
  • 21