38

To retrieve k random numbers from an array of undetermined size we use a technique called reservoir sampling. Can anybody briefly highlight how it happens with a sample code?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Jony
  • 6,694
  • 20
  • 61
  • 71
  • Related question http://stackoverflow.com/questions/54059/efficiently-selecting-a-set-of-random-elements-from-a-linked-list – Ian Mercer Apr 11 '10 at 06:11
  • [This page](http://blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx) contains a good explanation with pseudo-code. (The Wikipedia page I originally linked to is unclear, and the pseudo-code is incomplete.) – Stephen C Apr 10 '10 at 07:57
  • 1
    I wrote a blog entry about the exact thing a couple of months back, which has a C# implementation: http://gregbeech.com/blog/sampling-very-large-sequences The best description of how it works that I've found is here: http://gregable.com/2007/10/reservoir-sampling.html – Greg Beech Apr 10 '10 at 08:13

4 Answers4

46

I actually did not realize there was a name for this, so I proved and implemented this from scratch:

def random_subset(iterator, K):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len(result) < K:
            result.append(item)
        else:
            s = int(random.random() * N)
            if s < K:
                result[s] = item

    return result

From: http://web.archive.org/web/20141026071430/http://propersubset.com:80/2010/04/choosing-random-elements.html

With proof near the end.

funnydman
  • 9,083
  • 4
  • 40
  • 55
Larry
  • 4,491
  • 2
  • 24
  • 16
  • @Larry: Where is the `accept it with probability s/k` part in your code? [ quote from algorithm mentioned at http://blogs.msdn.com/spt/archive/2008/02/05/reservoir-sampling.aspx ] – Lazer Apr 11 '10 at 10:26
  • By coincidence, it seems like between that article and mine, we use the same variables, but for different things. My "K" appears to be their "S", and my "N" (in code) appears to be their "K". In other words, I accept things with `K/N` probability, where N is the current count of things. – Larry Apr 11 '10 at 16:14
  • what I meant to ask was how you were implementing probability in your code. Anyways, now I understand. Thanks! – Lazer Apr 12 '10 at 03:11
  • Not quite understanding your question, but I can explain the code a little more if you are specific! =) – Larry Apr 12 '10 at 04:31
  • Thanks. This is super easy to understand and get started with basics. – Gopinath May 13 '16 at 19:08
  • Is there a way to do this to create multiple reservoirs of varied sizes from an unbounded stream? That is to be able to iterate over a stream once and sample them into one of multiple reservoirs? Assume each reservoir could have a different size or equal sizes. – CMCDragonkai Apr 13 '18 at 01:21
  • With this implementation, don't the first `n` elements have a higher probability to be picked because there is less of a chance of them being overwritten later on? – Andi Oct 26 '20 at 20:00
22

Following Knuth's (1981) description more closely, Reservoir Sampling (Algorithm R) could be implemented as follows:

import random

def sample(iterable, n):
    """
    Returns @param n random items from @param iterable.
    """
    reservoir = []
    for t, item in enumerate(iterable):
        if t < n:
            reservoir.append(item)
        else:
            m = random.randint(0,t)
            if m < n:
                reservoir[m] = item
    return reservoir
sam
  • 1,406
  • 2
  • 15
  • 25
  • What's the difference between this and [the accepted answer](http://stackoverflow.com/a/2612822/481584)? I think this is exactly the same thing, even if this code is more elegant. – Quentin Pradet May 10 '17 at 13:09
  • 1
    It can be directly related to published research ([Knuth, 1981](https://github.com/djtrack16/thyme/blob/master/computer%20science/Donald.E.Knuth.The.Art.of.Computer.Programming.Volume.2.pdf)), in case someone is interested in more extended explanation or Knuth's proof. – sam Sep 25 '17 at 09:33
  • Where `0 <= random.randint(0,t) <= t` per [random.randint](https://docs.python.org/3/library/random.html#random.randint). – Ahmed Fasih Jan 18 '19 at 03:12
  • @sam Isn't `randint` inclusive? – user76284 Nov 03 '20 at 05:35
  • 1
    @user76284 Correct, and it should be. Let `iterable` contain 11 items from which we want to sample `n`=10. For the 11th item, `t` will be 10 (because `enumerate` starts at 0), and we generate a random integer between 0 and 10 inclusive (i.e., 11 possibilities) such that the probability of adding the 11th item to `reservoir` is n/t = 10/11. – sam Nov 07 '20 at 09:46
2

Java

import java.util.Random;

public static void reservoir(String filename,String[] list)
{
    File f = new File(filename);
    BufferedReader b = new BufferedReader(new FileReader(f));

    String l;
    int c = 0, r;
    Random g = new Random();

    while((l = b.readLine()) != null)
    {
      if (c < list.length)
          r = c++;
      else
          r = g.nextInt(++c);

      if (r < list.length)
          list[r] = l;

      b.close();}
}
abatishchev
  • 98,240
  • 88
  • 296
  • 433
Srba
  • 89
  • 5
0

Python solution

import random

class RESERVOIR_SAMPLING():
    def __init__(self, k=1000):
        self.reservoir = [] 
        self.k = k
        self.nb_processed = 0

    def add_to_reservoir(self, sample):
        self.nb_processed +=1
        if(self.k >= self.nb_processed):
            self.reservoir.append(sample)
        else:
            #randint(a,b) gives a<=int<=b
            j = random.randint(0,self.nb_processed-1)
            if(j < k):
                self.reservoir[j] = sample

k = 10
samples = [i for i in range(10)] * k
res = RESERVOIR_SAMPLING(k)
for sample in samples:
    res.add_to_reservoir(sample)

print(res.reservoir)

out[1]: [9, 8, 4, 8, 3, 5, 1, 7, 0, 9]
Firas Omrane
  • 894
  • 1
  • 14
  • 21