16

I am trying to test the likelihood that a particular clustering of data has occurred by chance. A robust way to do this is Monte Carlo simulation, in which the associations between data and groups are randomly reassigned a large number of times (e.g. 10,000), and a metric of clustering is used to compare the actual data with the simulations to determine a p value.

I've got most of this working, with pointers mapping the grouping to the data elements, so I plan to randomly reassign pointers to data. THE QUESTION: what is a fast way to sample without replacement, so that every pointer is randomly reassigned in the replicate data sets?

For example (these data are just a simplified example):

Data (n=12 values) - Group A: 0.1, 0.2, 0.4 / Group B: 0.5, 0.6, 0.8 / Group C: 0.4, 0.5 / Group D: 0.2, 0.2, 0.3, 0.5

For each replicate data set, I would have the same cluster sizes (A=3, B=3, C=2, D=4) and data values, but would reassign the values to the clusters.

To do this, I could generate random numbers in the range 1-12, assign the first element of group A, then generate random numbers in the range 1-11 and assign the second element in group A, and so on. The pointer reassignment is fast, and I will have pre-allocated all data structures, but the sampling without replacement seems like a problem that might have been solved many times before.

Logic or pseudocode preferred.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
Argalatyr
  • 4,639
  • 3
  • 36
  • 62

7 Answers7

41

Here's some code for sampling without replacement based on Algorithm 3.4.2S of Knuth's book Seminumeric Algorithms.

void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far
    double u;

    while (m < n)
    {
        u = GetUniform(); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

There is a more efficient but more complex method by Jeffrey Scott Vitter in "An Efficient Algorithm for Sequential Random Sampling," ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.

John D. Cook
  • 29,517
  • 10
  • 67
  • 94
  • I don't have this book (yet), and had trouble proving the correctness of the algorithm to myself. I implemented it in java and checked the population items are sampled with uniform probability. Results are convincing. See this [gist](https://gist.github.com/10864989) – Alban Apr 16 '14 at 12:28
  • An uncritical implementation of Vitter's Method D in Mathematica is orders of magnitude faster than the built-in algorithm. I describe it over here: http://tinyurl.com/lbldlpq – Reb.Cabin Sep 06 '14 at 23:46
  • 4
    @Alban - We can view the problem of sampling n elements from a population of N by considering the first element. There is a (n/N) probability that this element is included: if it is, then the problem reduces to sampling (n-1) elements out of (N-1) remaining; if not, then the problem reduces to sampling (n) elements out of (N-1) remaining. Some variable transformation will show that this is the essence of Knuth's algorithm (by incrementing t). – Hao Ye Nov 07 '14 at 23:23
  • Does it matter if `u` is in the open, half-open or closed interval, `(0, 1)`, `[0, 1)` or `[0, 1]`? Knuth just says "uniformly distributed between zero and one." – ZachB Aug 23 '21 at 20:08
  • @ZachB mathematically those intervals are the same [almost surely](https://en.wikipedia.org/wiki/Almost_surely). In application the probability of a random double between 0 and 1 being exactly 0 is something like 2^-53. – qwr Sep 15 '22 at 07:31
8

A C++ working code based on the answer by John D. Cook.

#include <random>
#include <vector>

// John D. Cook, https://stackoverflow.com/a/311716/15485
void SampleWithoutReplacement
(
    int populationSize,    // size of set sampling from
    int sampleSize,        // size of each sample
    std::vector<int> & samples  // output, zero-offset indicies to selected items
)
{
    // Use Knuth's variable names
    int& n = sampleSize;
    int& N = populationSize;

    int t = 0; // total input records dealt with
    int m = 0; // number of items selected so far

    std::default_random_engine re;
    std::uniform_real_distribution<double> dist(0,1);

    while (m < n)
    {
        double u = dist(re); // call a uniform(0,1) random number generator

        if ( (N - t)*u >= n - m )
        {
            t++;
        }
        else
        {
            samples[m] = t;
            t++; m++;
        }
    }
}

#include <iostream>
int main(int,char**)
{
  const size_t sz = 10;
  std::vector< int > samples(sz);
  SampleWithoutReplacement(10*sz,sz,samples);
  for (size_t i = 0; i < sz; i++ ) {
    std::cout << samples[i] << "\t";
  }

  return 0;
}
ZachB
  • 13,051
  • 4
  • 61
  • 89
Alessandro Jacopson
  • 18,047
  • 15
  • 98
  • 153
  • 1
    I edited your answer so it wouldn't be absurdly slow due to thread guards in GCC and other common compilers. Per my [comment on John's answer](https://stackoverflow.com/questions/311703/algorithm-for-sampling-without-replacement/311716#comment121764559_311716), I don't know if the interval should be open, half-open or closed, however. This is currently half-open. – ZachB Aug 23 '21 at 20:12
6

See my answer to this question Unique (non-repeating) random numbers in O(1)?. The same logic should accomplish what you are looking to do.

Community
  • 1
  • 1
Robert Gamble
  • 106,424
  • 25
  • 145
  • 137
  • Excellent! Sorry I did not see that answer when I searched SO (for sampling without replacement, statistics, algorithms, etc). Maybe this will serve like a meta-question to lead people like me to your original answer. Cheers! – Argalatyr Nov 22 '08 at 20:07
2

Inspired by @John D. Cook's answer, I wrote an implementation in Nim. At first I had difficulties understanding how it works, so I commented extensively also including an example. Maybe it helps to understand the idea. Also, I have changed the variable names slightly.

iterator uniqueRandomValuesBelow*(N, M: int) =
  ## Returns a total of M unique random values i with 0 <= i < N
  ## These indices can be used to construct e.g. a random sample without replacement
  assert(M <= N)

  var t = 0 # total input records dealt with
  var m = 0 # number of items selected so far

  while (m < M):
    let u = random(1.0) # call a uniform(0,1) random number generator

    # meaning of the following terms:
    # (N - t) is the total number of remaining draws left (initially just N)
    # (M - m) is the number how many of these remaining draw must be positive (initially just M)
    # => Probability for next draw = (M-m) / (N-t)
    #    i.e.: (required positive draws left) / (total draw left)
    #
    # This is implemented by the inequality expression below:
    # - the larger (M-m), the larger the probability of a positive draw
    # - for (N-t) == (M-m), the term on the left is always smaller => we will draw 100%
    # - for (N-t) >> (M-m), we must get a very small u
    #
    # example: (N-t) = 7, (M-m) = 5
    # => we draw the next with prob 5/7
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 6
    # => we draw the next with prob 5/6
    #    lets assume the draw succeeds
    # => t += 1, m += 1 => (N-t) = 5, (M-m) = 4
    # => we draw the next with prob 4/5
    #    lets assume the draw fails
    # => t += 1 => (N-t) = 4
    # => we draw the next with prob 4/4, i.e.,
    #    we will draw with certainty from now on
    #    (in the next steps we get prob 3/3, 2/2, ...)
    if (N - t)*u >= (M - m).toFloat: # this is essentially a draw with P = (M-m) / (N-t)
      # no draw -- happens mainly for (N-t) >> (M-m) and/or high u
      t += 1
    else:
      # draw t -- happens when (M-m) gets large and/or low u
      yield t # this is where we output an index, can be used to sample
      t += 1
      m += 1

# example use
for i in uniqueRandomValuesBelow(100, 5):
  echo i
Community
  • 1
  • 1
bluenote10
  • 23,414
  • 14
  • 122
  • 178
2

When the population size is much greater than the sample size, the above algorithms become inefficient, since they have complexity O(n), n being the population size.

When I was a student I wrote some algorithms for uniform sampling without replacement, which have average complexity O(s log s), where s is the sample size. Here is the code for the binary tree algorithm, with average complexity O(s log s), in R:

# The Tree growing algorithm for uniform sampling without replacement
# by Pavel Ruzankin 
quicksample = function (n,size)
# n - the number of items to choose from
# size - the sample size
{
  s=as.integer(size)
  if (s>n) {
    stop("Sample size is greater than the number of items to choose from")
  }
  # upv=integer(s) #level up edge is pointing to
  leftv=integer(s) #left edge is poiting to; must be filled with zeros
  rightv=integer(s) #right edge is pointig to; must be filled with zeros
  samp=integer(s) #the sample
  ordn=integer(s) #relative ordinal number

  ordn[1L]=1L #initial value for the root vertex
  samp[1L]=sample(n,1L) 
  if (s > 1L) for (j in 2L:s) {
    curn=sample(n-j+1L,1L) #current number sampled
    curordn=0L #currend ordinal number
    v=1L #current vertice
    from=1L #how have come here: 0 - by left edge, 1 - by right edge
    repeat {
      curordn=curordn+ordn[v]
      if (curn+curordn>samp[v]) { #going down by the right edge
        if (from == 0L) {
          ordn[v]=ordn[v]-1L
        }
        if (rightv[v]!=0L) {
          v=rightv[v]
          from=1L
        } else { #creating a new vertex
          samp[j]=curn+curordn
          ordn[j]=1L
          # upv[j]=v
          rightv[v]=j
          break
        }
      } else { #going down by the left edge
        if (from==1L) {
          ordn[v]=ordn[v]+1L
        }
        if (leftv[v]!=0L) {
          v=leftv[v]
          from=0L
        } else { #creating a new vertex
          samp[j]=curn+curordn-1L
          ordn[j]=-1L
          # upv[j]=v
          leftv[v]=j
          break
        }
      }
    }
  }
  return(samp)  
}

The complexity of this algorithm is discussed in: Rouzankin, P. S.; Voytishek, A. V. On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 (1999), no. 1, 39-54. http://dx.doi.org/10.1515/mcma.1999.5.1.39

If you find the algorithm useful, please make a reference.

See also: P. Gupta, G. P. Bhattacharjee. (1984) An efficient algorithm for random sampling without replacement. International Journal of Computer Mathematics 16:4, pages 201-209. DOI: 10.1080/00207168408803438

Teuhola, J. and Nevalainen, O. 1982. Two efficient algorithms for random sampling without replacement. /IJCM/, 11(2): 127–140. DOI: 10.1080/00207168208803304

In the last paper the authors use hash tables and claim that their algorithms have O(s) complexity. There is one more fast hash table algorithm, which will soon be implemented in pqR (pretty quick R): https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

1

I wrote a survey of algorithms for sampling without replacement. I may be biased but I recommend my own algorithm, implemented in C++ below, as providing the best performance for many k, n values and acceptable performance for others. randbelow(i) is assumed to return a fairly chosen random non-negative integer less than i.

void cardchoose(uint32_t n, uint32_t k, uint32_t* result) {
    auto t = n - k + 1;
    for (uint32_t i = 0; i < k; i++) {
        uint32_t r = randbelow(t + i);
        if (r < t) {
            result[i] = r;
        } else {
            result[i] = result[r - t];
        }
    }
    std::sort(result, result + k);
    for (uint32_t i = 0; i < k; i++) {
        result[i] += i;
    }
}
Paul Crowley
  • 1,656
  • 1
  • 14
  • 26
  • How does it compare to std::sample and ranges::sample? – Surt Aug 24 '21 at 00:02
  • This would depend on how your particular C++ stdlib implements it. In both cases, the docs say "This function may implement selection sampling or reservoir sampling" so perhaps it will perform similarly to my implementation of one of these algorithms, but you'd have to test for yourself to be sure. – Paul Crowley Aug 25 '21 at 04:04
0

Another algorithm for sampling without replacement is described here.

It is similar to the one described by John D. Cook in his answer and also from Knuth, but it has different hypothesis: The population size is unknown, but the sample can fit in memory. This one is called "Knuth's algorithm S".

Quoting the rosettacode article:

  1. Select the first n items as the sample as they become available;
  2. For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If not, have it randomly (1/n) replace one of the previously selected n items of the sample.
  3. Repeat #2 for any subsequent items.
Argalatyr
  • 4,639
  • 3
  • 36
  • 62
Alban
  • 1,915
  • 1
  • 14
  • 17
  • 1
    Rosettacode has the wrong name for the algorithm: it should be "Algorithm R" or "Reservoir Sampling". "Algorithm S" (aka "Selection Sampling Technique") requires to know in advance the population size. Both algorithms are described in TAOCP - Vol 2 - §3.4.2 – manlio Feb 01 '16 at 10:39