1

I need to implement Weighted Reservoir Sampling. I have referred to the paper mentioned in this blog. I want to write test-cases for unit testing my implementation and am confused as to how to calculate expected-probability of different element to be in reservoir.

I thought it should be propotional to (weight_of_element/weight_of_all_elements), but the test-case mentioned here computes it differently. How should I do it?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Amit
  • 614
  • 1
  • 8
  • 18
  • The expected probability for every element should be equal, correct? What is the problem you are running into? Are you confused about the sampling process? – eigenchris Mar 11 '15 at 20:01
  • @eigenchris this is not "reservoir sampling" but "weighted reservoir sampling", so it should be propotional to weights of the element. But in the test-cases, expected values don't correspond to that. – Amit Mar 12 '15 at 02:28

1 Answers1

0

In order to write a test-case, you can indeed estimate the probability of the element being chosen. Suppose you've assigned weights like this:
weights = [1, 5, 8, 2, 5]

Now you are doing Weighted Reservoir Sampling in order to draw one element. What are the probabilities of element appearing in the result? They are exactly (weight_of_element/weight_of_all_elements):
prob = [0.048, 0.238, 0.381, 0.095, 0.238]

In other words, if you draw one element repeatedly 106 times, you should have 0.381 * 106 instances of the third element, 0.048 * 106 instances of the first element, and so on. Approximately, of course.

Therefore, you can look at the percentage of times the first element appeared out of 106 trials. This must be approximately (weight_of_first_element/weight_of_all_elements). Compare those values and see if they are close to each other.

So the test-case may look like this (pseudo-code):

numTrials = 1000000
histogram = map<int, int>
for i = 1..numTrials:
  element = WeightedReservoir.sample(weights, 1) # draw one element
  histogram[element]++
for i = 1..len(weights):
  real_probability = weights[i] / sum(weights)
  observed_probability = histogram[elements[i]] / numTrials
  assert(abs(real_probability - observed_probability) <= epsilon) # measuring absolute difference, but you can switch to relative difference

See this thread for specific implementation in Java.

As for Cloudera test you've pointed out, it follows different logic (I've seen this in Python package numpy tests for numpy.random.choice as well):

  1. The sampling routine is probabilistic in its nature, i.e. non-deterministic;
  2. Let's take a fixed seed value for random number generation. Embed this value into test-case. Now it is fully deterministic: invoking the test-case several times yields the same result;
  3. Since the result is deterministic, we can obtain it manually (for small inputs). Embed the expected result into test.

If you cannot control the seed value, this method is not for you.

Mike Koltsov
  • 336
  • 3
  • 6