49

This question led to a new R package: wrswoR

R's default sampling without replacement using sample.int seems to require quadratic run time, e.g. when using weights drawn from a uniform distribution. This is slow for large sample sizes. Does anybody know a faster implementation that would be usable from within R? Two options are "rejection sampling with replacement" (see this question on stats.sx) and the algorithm by Wong and Easton (1980) (with a Python implementation in a StackOverflow answer).

Thanks to Ben Bolker for hinting at the C function that is called internally when sample.int is called with replace=F and non-uniform weights: ProbSampleNoReplace. Indeed, the code shows two nested for loops (line 420 ff of random.c).

Here's the code to analyze the run time empirically:

library(plyr)

sample.int.test <- function(n, p) {
    sample.int(2 * n, n, replace=F, prob=p); NULL }

times <- ldply(
  1:7,
  function(i) {
    n <- 1024 * (2 ** i)
    p <- runif(2 * n)
    data.frame(
      n=n,
      user=system.time(sample.int.test(n, p), gcFirst=T)['user.self'])
  },
  .progress='text'
)

times

library(ggplot2)
ggplot(times, aes(x=n, y=user/n)) + geom_point() + scale_x_log10() +
  ylab('Time per unit (s)')

# Output:
       n   user
1   2048  0.008
2   4096  0.028
3   8192  0.100
4  16384  0.408
5  32768  1.645
6  65536  6.604
7 131072 26.558

Plot

EDIT: Thanks to Arun for pointing out that unweighted sampling doesn't seem to have this performance penalty.

Community
  • 1
  • 1
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • Try moving the `runif(2*n)` part outside of `sample.int.test`? – BenBarnes Feb 27 '13 at 13:56
  • It's the `runif` that's time consuming. – Arun Feb 27 '13 at 13:56
  • @BenBarnes, those are `prob` values for `sample.int`. – Arun Feb 27 '13 at 13:57
  • 1
    I don't see why `runif()` would make the run-time *quadratic* ... – Ben Bolker Feb 27 '13 at 13:59
  • 1
    The same statement with out `runif(2*n)` runs in 0.001 seconds. – Arun Feb 27 '13 at 14:00
  • 3
    It seems rather that `sample` with `prob` is time consuming. – Arun Feb 27 '13 at 14:02
  • 3
    The algorithm used is viewable at https://github.com/wch/r-source/blob/trunk/src/main/random.c : search for `ProbSampleReplace` . I have no idea if that's useful, but it should give you a ballpark idea of what algorithm is used and whether it can easily be improved. I note that it is sorting the entire vector ... – Ben Bolker Feb 27 '13 at 14:09
  • http://stackoverflow.com/questions/2140787/select-random-k-elements-from-a-list-whose-elements-have-weights , http://stackoverflow.com/questions/352670/weighted-random-selection-with-and-without-replacement ... although of course these are all about algorithms, and you're looking for an *implementation* rather than an *algorithm* – Ben Bolker Feb 27 '13 at 14:12
  • 1
    the only other thing I can suggest is that you try `library("sos"): findFn("{sampling without replacement}")` and sort through the results to see if there's anything useful – Ben Bolker Feb 27 '13 at 15:19
  • Not sure if this helps, but it seems that the algorithm of Wong & Easton is implemented in Matlab: http://www.mathworks.se/help/stats/datasample.html – Jouni Helske Mar 02 '13 at 10:10
  • @Hemmo: Thank you, but I'm not sure if I can use MATLAB's implementation for R, since MATLAB is a commercial product... – krlmlr Mar 02 '13 at 17:10
  • 1
    Few other links I happened to find (needs subscription, so I can't check them more closely before monday). http://link.springer.com/chapter/10.1007%2F978-3-642-30347-0_27 and http://dl.acm.org/citation.cfm?id=1711169. Not sure if there's any real implementation though. – Jouni Helske Mar 02 '13 at 17:25
  • @Hemmo: This is sweet. The second link currently 503-s, but the first seems to answer my other question: http://stackoverflow.com/questions/15114898/sampling-without-replacement-with-unequal-probabilites-linear-run-time-possib – krlmlr Mar 04 '13 at 05:35
  • 1
    @krlmlr I know it's bad form to post a 'thanks' comment, but your `wrswoR` package has reduced the time taken to run my code from about 8 hours to <1min! Thank you! – Phil Jun 25 '17 at 21:33

3 Answers3

24

Update:

An Rcpp implementation of Efraimidis & Spirakis algorithm (thanks to @Hemmo, @Dinrem, @krlmlr and @rtlgrmpf):

library(inline)
library(Rcpp)
src <- 
'
int num = as<int>(size), x = as<int>(n);
Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x);
Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob);
Rcpp::NumericVector rnd = rexp(x) / pr;
for(int i= 0; i<vx.size(); ++i) vx[i] = i;
std::partial_sort(vx.begin(), vx.begin() + num, vx.end(), Comp(rnd));
vx = vx[seq(0, num - 1)] + 1;
return vx;
'
incl <- 
'
struct Comp{
  Comp(const Rcpp::NumericVector& v ) : _v(v) {}
  bool operator ()(int a, int b) { return _v[a] < _v[b]; }
  const Rcpp::NumericVector& _v;
};
'
funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"),
                       src, plugin = "Rcpp", include = incl)

# See the bottom of the answer for comparison
p <- c(995/1000, rep(1/1000, 5))
n <- 100000
system.time(print(table(replicate(funFast(6, 3, p), n = n)) / n))

      1       2       3       4       5       6 
1.00000 0.39996 0.39969 0.39973 0.40180 0.39882 
   user  system elapsed 
   3.93    0.00    3.96 
# In case of:
# Rcpp::IntegerVector vx = Rcpp::clone<Rcpp::IntegerVector>(x);
# i.e. instead of NumericVector
      1       2       3       4       5       6 
1.00000 0.40150 0.39888 0.39925 0.40057 0.39980 
   user  system elapsed 
   1.93    0.00    2.03 

Old version:

Let us try a few possible approaches:

Simple rejection sampling with replacement. This a far more simple function than sample.int.rej offered by @krlmlr, i.e. sample size is always equal to n. As we will see, it is still really fast assuming uniform distribution for weights, but extremely slow in another situation.

fastSampleReject <- function(all, n, w){
  out <- numeric(0)
  while(length(out) < n)
    out <- unique(c(out, sample(all, n, replace = TRUE, prob = w)))
  out[1:n]
}

The algorithm by Wong and Easton (1980). Here is an implementation of this Python version. It is stable and I might be missing something, but it is much slower compared to other functions.

fastSample1980 <- function(all, n, w){
  tws <- w
  for(i in (length(tws) - 1):0)
    tws[1 + i] <- sum(tws[1 + i], tws[1 + 2 * i + 1], 
                      tws[1 + 2 * i + 2], na.rm = TRUE)      
  out <- numeric(n)
  for(i in 1:n){
    gas <- tws[1] * runif(1)
    k <- 0        
    while(gas > w[1 + k]){
      gas <- gas - w[1 + k]
      k <- 2 * k + 1
      if(gas > tws[1 + k]){
        gas <- gas - tws[1 + k]
        k <- k + 1
      }
    }
    wgh <- w[1 + k]
    out[i] <- all[1 + k]        
    w[1 + k] <- 0
    while(1 + k >= 1){
      tws[1 + k] <- tws[1 + k] - wgh
      k <- floor((k - 1) / 2)
    }
  }
  out
}

Rcpp implementation of the algorithm by Wong and Easton. Possibly it can be optimized even more since this is my first usable Rcpp function, but anyway it works well.

library(inline)
library(Rcpp)

src <-
'
Rcpp::NumericVector weights = Rcpp::clone<Rcpp::NumericVector>(w);
Rcpp::NumericVector tws = Rcpp::clone<Rcpp::NumericVector>(w);
Rcpp::NumericVector x = Rcpp::NumericVector(all);
int k, num = as<int>(n);
Rcpp::NumericVector out(num);
double gas, wgh;

if((weights.size() - 1) % 2 == 0){
  tws[((weights.size()-1)/2)] += tws[weights.size()-1] + tws[weights.size()-2];
}
else
{
  tws[floor((weights.size() - 1)/2)] += tws[weights.size() - 1];
}

for (int i = (floor((weights.size() - 1)/2) - 1); i >= 0; i--){
  tws[i] += (tws[2 * i + 1]) + (tws[2 * i + 2]);
}
for(int i = 0; i < num; i++){
  gas = as<double>(runif(1)) * tws[0];
  k = 0;
  while(gas > weights[k]){
    gas -= weights[k];
    k = 2 * k + 1;
    if(gas > tws[k]){
      gas -= tws[k];
      k += 1;
    }
  }
  wgh = weights[k];
  out[i] = x[k];
  weights[k] = 0;
  while(k > 0){
    tws[k] -= wgh;
    k = floor((k - 1) / 2);
  }
  tws[0] -= wgh;
}
return out;
'

fun <- cxxfunction(signature(all = "numeric", n = "integer", w = "numeric"),
                   src, plugin = "Rcpp")

Now some results:

times1 <- ldply(
  1:6,
  function(i) {
    n <- 1024 * (2 ** i)
    p <- runif(2 * n) # Uniform distribution
    p <- p/sum(p)
    data.frame(
      n=n,
      user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'],
             system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'],
             system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'],
             system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'],
             system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'],
             system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']),
      id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980"))
  },
  .progress='text'
)


times2 <- ldply(
  1:6,
  function(i) {
    n <- 1024 * (2 ** i)
    p <- runif(2 * n - 1)
    p <- p/sum(p) 
    p <- c(0.999, 0.001 * p) # Special case
    data.frame(
      n=n,
      user=c(system.time(sample.int.test(n, p), gcFirst=T)['user.self'],
             system.time(weighted_Random_Sample(1:(2*n), p, n), gcFirst=T)['user.self'],
             system.time(fun(1:(2*n), n, p), gcFirst=T)['user.self'],
             system.time(sample.int.rej(2*n, n, p), gcFirst=T)['user.self'],
             system.time(fastSampleReject(1:(2*n), n, p), gcFirst=T)['user.self'],
             system.time(fastSample1980(1:(2*n), n, p), gcFirst=T)['user.self']),
      id=c("Base", "Reservoir", "Rcpp", "Rejection", "Rejection simple", "1980"))
  },
  .progress='text'
)

enter image description here

enter image description here

arrange(times1, id)
       n  user               id
1   2048  0.53             1980
2   4096  0.94             1980
3   8192  2.00             1980
4  16384  4.32             1980
5  32768  9.10             1980
6  65536 21.32             1980
7   2048  0.02             Base
8   4096  0.05             Base
9   8192  0.18             Base
10 16384  0.75             Base
11 32768  2.99             Base
12 65536 12.23             Base
13  2048  0.00             Rcpp
14  4096  0.01             Rcpp
15  8192  0.03             Rcpp
16 16384  0.07             Rcpp
17 32768  0.14             Rcpp
18 65536  0.31             Rcpp
19  2048  0.00        Rejection
20  4096  0.00        Rejection
21  8192  0.00        Rejection
22 16384  0.02        Rejection
23 32768  0.02        Rejection
24 65536  0.03        Rejection
25  2048  0.00 Rejection simple
26  4096  0.01 Rejection simple
27  8192  0.00 Rejection simple
28 16384  0.01 Rejection simple
29 32768  0.00 Rejection simple
30 65536  0.05 Rejection simple
31  2048  0.00        Reservoir
32  4096  0.00        Reservoir
33  8192  0.00        Reservoir
34 16384  0.02        Reservoir
35 32768  0.03        Reservoir
36 65536  0.05        Reservoir

arrange(times2, id)
       n  user               id
1   2048  0.43             1980
2   4096  0.93             1980
3   8192  2.00             1980
4  16384  4.36             1980
5  32768  9.08             1980
6  65536 19.34             1980
7   2048  0.01             Base
8   4096  0.04             Base
9   8192  0.18             Base
10 16384  0.75             Base
11 32768  3.11             Base
12 65536 12.04             Base
13  2048  0.01             Rcpp
14  4096  0.02             Rcpp
15  8192  0.03             Rcpp
16 16384  0.08             Rcpp
17 32768  0.15             Rcpp
18 65536  0.33             Rcpp
19  2048  0.00        Rejection
20  4096  0.00        Rejection
21  8192  0.02        Rejection
22 16384  0.02        Rejection
23 32768  0.05        Rejection
24 65536  0.08        Rejection
25  2048  1.43 Rejection simple
26  4096  2.87 Rejection simple
27  8192  6.17 Rejection simple
28 16384 13.68 Rejection simple
29 32768 29.74 Rejection simple
30 65536 73.32 Rejection simple
31  2048  0.00        Reservoir
32  4096  0.00        Reservoir
33  8192  0.02        Reservoir
34 16384  0.02        Reservoir
35 32768  0.02        Reservoir
36 65536  0.04        Reservoir

Obviously we can reject function 1980 because it is slower than Base in both cases. Rejection simple gets into trouble too when there is a single probability 0.999 in the second case.

So there remains Rejection, Rcpp, Reservoir. The last step is checking whether the values themselves are correct. To be sure about them, we will be using sample as a benchmark (also to eliminate the confusion about probabilities which do not have to coincide with p because of sampling without replacement).

p <- c(995/1000, rep(1/1000, 5))
n <- 100000

system.time(print(table(replicate(sample(1:6, 3, repl = FALSE, prob = p), n = n))/n))
      1       2       3       4       5       6 
1.00000 0.39992 0.39886 0.40088 0.39711 0.40323  # Benchmark
   user  system elapsed 
   1.90    0.00    2.03 

system.time(print(table(replicate(sample.int.rej(2*3, 3, p), n = n))/n))
      1       2       3       4       5       6 
1.00000 0.40007 0.40099 0.39962 0.40153 0.39779 
   user  system elapsed 
  76.02    0.03   77.49 # Slow

system.time(print(table(replicate(weighted_Random_Sample(1:6, p, 3), n = n))/n))
      1       2       3       4       5       6 
1.00000 0.49535 0.41484 0.36432 0.36338 0.36211  # Incorrect
   user  system elapsed 
   3.64    0.01    3.67 

system.time(print(table(replicate(fun(1:6, 3, p), n = n))/n))
      1       2       3       4       5       6 
1.00000 0.39876 0.40031 0.40219 0.40039 0.39835 
   user  system elapsed 
   4.41    0.02    4.47 

Notice a few things here. For some reason weighted_Random_Sample returns incorrect values (I have not looked into it at all, but it works correct assuming uniform distribution). sample.int.rej is very slow in repeated sampling.

In conclusion, it seems that Rcpp is the optimal choice in case of repeated sampling while sample.int.rej is a bit faster otherwise and also easier to use.

Community
  • 1
  • 1
Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • Very nice, especially the code that tests the samplers! It might be that `weighted_Random_Sample` suffers from the limited precision of IEEE floating-point values. Unfortunately, Dinre hasn't replied yet to my request concerning that topic, but see [this related question](http://math.stackexchange.com/q/320761/16420) on math.sx :-) – krlmlr Mar 04 '13 at 21:02
  • Please check `sample.int.rank` in my package, file [sample_int_rank.R](https://github.com/muelleki/kimisc/blob/master/R/sample_int_rank.R). This function is an implementation of reservoir sampling that passes your distribution test. – krlmlr Mar 05 '13 at 01:24
  • @krlmlr, I see, now it is a valid winner in the algorithms elegance contest :) However, it is slow in repeated sampling. I will update a bit my answer later today. – Julius Vainora Mar 05 '13 at 06:54
  • Don't bother yet, `sort(rexp(n))` is already slow. Let's wait for an Rcpp solution, shouldn't be too difficult. – krlmlr Mar 05 '13 at 08:04
  • I have set up a package just for sampling algorithms: https://github.com/muelleki/wrswoR . Please feel free to fork and add your Wong-Easton implementation, and also the test code. – krlmlr Mar 05 '13 at 08:53
  • 2
    Just to throw an extra wrench in the process, on my machine, I got almost identical results between the benchmark `1.0000 0.4017 0.4002 0.4091 0.3917 0.3973 ` and weighted_Random_Sample `1.0000 0.3991 0.4056 0.4012 0.4014 0.3927 `. I was using `p <- c(955, rep(1, 5))`, though, since that is actually a more reliable method for sampling in general. All of the major sampling functions accept weightings and do not assume `sum(prob) == 1`, so I prefer to use weightings in all my work. This may limit how the WRS algorithm can be used, though. – Dinre Mar 05 '13 at 12:49
  • 1
    @Dinre: This shows that your implementation depends on the magnitude of the weights -- using `-rexp(n) / prob = log(runif(n)) / prob` is numerically more robust than `runif(n) ^ (1 / prob)`, and equivalent w.r.t. sort order. – krlmlr Mar 05 '13 at 12:56
  • @krlmlr, I only did a direct translation of the algorithm presented in the Egraimidis & Spirakis paper as a proof-of-concept. It works. Shoring it up into a robust function will need more than just numerical work; it currently doesn't check for validity in the arguments, either. In its current form, the function is just a proof that the algorithm can be used successfully, which is the first step to proving an algorithm exists in practical space. The paper only proved that the algorithm existed in theoretical space. – Dinre Mar 05 '13 at 13:05
  • @Julius: Thanks for the `Rcpp` code. I have incorporated a modified version into the `wrswoR` package. – krlmlr Mar 06 '13 at 09:47
21

I decided to dig down into some of the comments and found the Efraimidis & Spirakis paper to be fascinating (thanks to @Hemmo for finding the reference). The general idea in the paper is this: create a key by generating a random uniform number and raising it to the power of one over the weight for each item. Then, you simply take the highest key values as your sample. This works out brilliantly!

weighted_Random_Sample <- function(
    .data,
    .weights,
    .n
    ){

    key <- runif(length(.data)) ^ (1 / .weights)
    return(.data[order(key, decreasing=TRUE)][1:.n])
}

If you set '.n' to be the length of '.data' (which should always be the length of '.weights'), this is actually a weighted reservoir permutation, but the method works well for both sampling and permutation.

Update: I should probably mention that the above function expects the weights to be greater than zero. Otherwise key <- runif(length(.data)) ^ (1 / .weights) won't be ordered properly.


Just for kicks, I also used the test scenario in the OP to compare both functions.

set.seed(1)

times_WRS <- ldply(
1:7,
function(i) {
    n <- 1024 * (2 ** i)
    p <- runif(2 * n)
    n_Set <- 1:(2 * n)
    data.frame(
      n=n,
      user=system.time(weighted_Random_Sample(n_Set, p, n), gcFirst=T)['user.self'])
  },
  .progress='text'
)

sample.int.test <- function(n, p) {
sample.int(2 * n, n, replace=F, prob=p); NULL }

times_sample.int <- ldply(
  1:7,
  function(i) {
    n <- 1024 * (2 ** i)
    p <- runif(2 * n)
    data.frame(
      n=n,
      user=system.time(sample.int.test(n, p), gcFirst=T)['user.self'])
  },
  .progress='text'
)

times_WRS$group <- "WRS"
times_sample.int$group <- "sample.int"
library(ggplot2)

ggplot(rbind(times_WRS, times_sample.int) , aes(x=n, y=user/n, col=group)) + geom_point() + scale_x_log10() +  ylab('Time per unit (s)')

And here are the times:

times_WRS
#        n user
# 1   2048 0.00
# 2   4096 0.01
# 3   8192 0.00
# 4  16384 0.01
# 5  32768 0.03
# 6  65536 0.06
# 7 131072 0.16

times_sample.int
#        n  user
# 1   2048  0.02
# 2   4096  0.05
# 3   8192  0.14
# 4  16384  0.58
# 5  32768  2.33
# 6  65536  9.23
# 7 131072 37.79

performance comparison

Dinre
  • 4,196
  • 17
  • 26
  • 1
    Is this really the same as weighted sampling without replacement, or just an approximation? I have asked a question on this on [math.sx](http://math.stackexchange.com/questions/316175/probability-to-choose-specific-item-in-a-weighted-sampling-without-replacement) but got no answer... – krlmlr Mar 04 '13 at 15:39
  • Well, all forms of random sampling are approximations, so I suppose the answer to both questions is "yes." More specifically, though, this method does meet all of the sampling criteria that I am familiar with, since in this case a weighting of 2 vs 1 is twice as likely to be chosen as the first element in the sequence. It is important to note that in a non-replacement scenario, the only question is *where* an item will be placed and not *if*. – Dinre Mar 04 '13 at 15:50
  • ...unless you want to sample only part of the population...? – krlmlr Mar 04 '13 at 15:51
  • 1
    No. It's the same thing, since you're not using replacement. You still order the whole set, but you select the 1:n number of items from the ordered set. – Dinre Mar 04 '13 at 15:54
  • I suppose I should amend my last comment: It's the same thing in *this* algorithm, because the entire set is always ordered. That's why you don't have the need for loops to evaluate the validity of each item selection. – Dinre Mar 04 '13 at 16:09
  • By the way, your question on Math.SE can be answered in the Efraimidis & Spirakis paper. The proof is done there on pg 182/183 (pg 2/3 of the article). You should probably contact the authors and request a copy of the article if you don't otherwise have access. They'll likely be happy (read: thrilled) to accommodate, and many of your questions could be answered with a quick read. – Dinre Mar 04 '13 at 16:25
  • 3
    The algorithm by Pavlos S. Efraimidis and Paul G. Spirakis is by far the most beautiful thing I've seen for a long time, just for it's simplicity. It's just as sweet as implementing convolution through FFT, not sure which wins though... NB: The authors prove that their algorithm is equivalent to weighted random sampling without replacement. – krlmlr Mar 04 '13 at 19:11
  • This looks great, glad that my link helped. – Jouni Helske Mar 04 '13 at 19:18
  • I have added a variant with improved numerical stability to my [kimisc](https://github.com/muelleki/kimisc) package, function `sample.int.rank`, file [sample_int_rank.R](https://github.com/muelleki/kimisc/blob/master/R/sample_int_rank.R). Please let me know if and how you want to be included as author. An Rcpp version is still missing :-) – krlmlr Mar 05 '13 at 01:23
  • I have moved the method to a new package just for sampling algorithms: github.com/muelleki/wrswoR, there's also a stub for an Rcpp-based implementation of `sample.int.rank` (in `sample_int_crank.cpp`). Do you want to supply an Rcpp version of the algorithm? – krlmlr Mar 05 '13 at 09:07
  • @krlmlr, Regarding floating point errors for the key calculation, it's irrelevant, because FP errors are essentially a small bit of stochastic noise. This does not impact a "randomized" process, since the additional noise actually serves our purpose. None of the data values are acted on, so they remain stable. If you're doing FP weights that experience sensitivity to FP error by themselves, you're probably using the wrong weights and could substitute with integers. – Dinre Mar 05 '13 at 12:36
  • @krlmlr, I don't particularly need authorship credit. I only coded up a literal translation of the algorithm in the paper. The paper should probably be cited, though. If someone else comes along and wants to add error-checking for all the arguments, that is probably worth an authorship note. – Dinre Mar 05 '13 at 12:37
  • @Dinre: The floating point error is very relevant, see the end of Julius' answer: http://stackoverflow.com/a/15210797/946850 . It doesn't happen with my implementation in the wrswoR package on Github. – krlmlr Mar 05 '13 at 12:49
  • @krlmlr, I see what you're talking about. It doesn't impact the performance of this algorithm in terms of its ability to use weightings. It will likely cause problems where you have high-precision probabilities where `sum(prob)==1`, though, since the FP math will start using too many decimal places. In general, though, if you want accurate math for weightings, it's best to use weightings, rather than probabilities. Then, the use of integers is not only appropriate, but also more accurate and reliable. – Dinre Mar 05 '13 at 12:53
  • I agree with @krlmr: this is a simply beautiful idea. Thanks, Dinre, for highlighting it! – Josh O'Brien Mar 05 '13 at 18:36
3

Let me throw in my own implementation of a faster approach based on rejection sampling with replacement. The idea is this:

  • Generate a sample with replacement that is "somewhat" larger than the requested size

  • Throw away the duplicate values

  • If not enough values have been drawn, call the same procedure recursively with adjusted n, size and prob parameters

  • Remap the returned indexes to the original indexes

How big a sample do we need to draw? Assuming a uniform distribution, the result is the expected number of trials to see x unique values out of N total values. It is a difference of two harmonic numbers (H_n and H_{n - size}). The first few harmonic numbers are tabulated, otherwise an approximation using the natural logarithm is used. (This is only a ballpark figure, no need to be too precise here.) Now, for a non-uniform distribution, the expected number of items to be drawn can only be larger, so we won't be drawing too many samples. In addition, the number of samples drawn is limited by twice the population size -- I assume that it's faster to have a few recursive calls than sampling up to O(n ln n) items.

The code is available in the R package wrswoR in the sample.int.rej routine in sample_int_rej.R. Install with:

library(devtools)
install_github('wrswoR', 'muelleki')

It seems to work "fast enough", however no formal runtime tests have been carried out yet. Also, the package is tested in Ubuntu only. I appreciate your feedback.

Community
  • 1
  • 1
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • Yes, assuming a uniform distribution it is fast, but assuming not so convenient case it gets quite bad. Today I will post an answer about it, R implementation of [this](http://stackoverflow.com/a/2149533/946850) and I will try to finish its `Rcpp` version. – Julius Vainora Mar 04 '13 at 15:24
  • @Julius: Looking forward to the benchmark :-) I have tested my code with exponential distribution of weights (which is the worst you can get with IEEE floats), expecting really horrible behavior, but to my surprise it was not that bad... – krlmlr Mar 04 '13 at 15:36
  • @krlmlr, the way you describe the algorithm, the inclusion probabilities will *not* be proportional to the weights used. – Ferdinand.kraft Aug 23 '13 at 01:12
  • @Ferdinand.kraft: O'rly? Care to elaborate? – krlmlr Aug 23 '13 at 07:03
  • @krlmlr in rejection sampling, you must discard *the entire sample* if there are duplicates in the with-replacement step. And even then the inclusion probabilities relate to the weights in a complex, non-linear fashion. In design-based inference, we need the final inclusion probabilities, not the weights used inside the algorithm. So my point is: how to calculate them given the weights and size of a sample drawn using the method you propose? – Ferdinand.kraft Aug 23 '13 at 12:19
  • @Ferdinand.kraft: The [answer by whuber](http://stats.stackexchange.com/a/20592/6432) contains some detail on why it is allowed to sample element by element using the relative weights and discarding each duplicate element found. Now, in this process, sampling element by element is equivalent to sampling a bulk and then discarding the duplicates. If you are still in doubt, you can compare results for the implementation in my wrswoR package to those of sample.int or even to that of an implementation of the Efraimidis&Spirakis algorithm. Let me know if you find any meaningful difference. – krlmlr Aug 23 '13 at 13:01