2

I'd like to read more about an algorithm that's used in R for unequal probability sampling, but after a few hours of searching I haven't been able to turn anything up on it. I thought it might have been an Art of Computer Programming algorithm, but I haven't been able to substantiate that either. The particular function in R's random.c is called ProbSampleNoReplace().

Given a vector of probabilities prob[] and a desired sample size n with a vector of selected items ans[]

For each element j in prob[] assign an index perm[j]
Sort the list in order of probability value, largest first

totalmass = 1
For (h=0, n1= n-1, h<nans, h++,n1-- )
    rt = totalmass * rand(in 0:1)
    mass = 0

    **sum the probabilities, largest first, until the sum is bigger than rt**
    for(j=0;j<n1;j++)
        mass += prob[j]
        if rt <= mass then break

    ans[h] = perm[j]
    **reduce size of totalmass to reflect removed item**
    totalmass -= prob[j]

    **reset the indices to be sequential**
    for(k=j, k<n1, k++)
        prob[k] = prob[k+1]
        perm[k] = perm[k+1]
Patrick McCarthy
  • 2,478
  • 2
  • 24
  • 40
  • 2
    What's your question? – alexwhan Mar 17 '13 at 21:52
  • I'd like to find out either who came up with this algorithm, or what the algorithm's name is, if anything. – Patrick McCarthy Mar 17 '13 at 22:20
  • @PatrickMcCarthy, beware that, in the `sample()` function in R, when used with unequal probabilities and without replacement, the inclusion probabilities are NOT the same as the probabilities used in the input vector `prob.` You can test that with the following code: `N <- 11` `n <- 5` `reps <- 100000` `prob <- (1:N)/sum(1:N)` `counts <- rep(0, N)` `for( i in 1:reps )` `{` `s <- sample(N,size=n,replace=FALSE,prob=prob)` `counts[s] <- counts[s] + 1` `}` `print(counts/reps)` `print(n*prob)` – Ferdinand.kraft Mar 18 '13 at 00:46
  • Read the source? https://github.com/wch/r-source/blob/trunk/src/main/random.c#L469 – hadley Mar 18 '13 at 01:35
  • I've read it - in fact I started there, and it's from there that I wrote the above not-so-pseudocode - but I'd like to learn more about what sort of variance estimators are valid, what the corresponding second-order probabilities' properties are, etc., so if it's something that exists I hope that a textbook or academic paper may be able to shed more light on the topic. – Patrick McCarthy Mar 18 '13 at 04:54

1 Answers1

1

The sample function supports unequal probability arguments. Your code fragment is not clear as to its intent to those of us who do not read C.

> table( sample(1:4, 100, repl=TRUE, prob=4:1) )

 1  2  3  4 
46 23 24  7 

There is another SO Q&A that may be useful (found by an SO search with arguments):

random.c ProbSampleNoReplace

Faster weighted sampling without replacement

Community
  • 1
  • 1
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thanks. I had found that- rather than replacing sample(), I'd like to understand it better, thus the need for author or name so I can search it. – Patrick McCarthy Mar 17 '13 at 22:22
  • I think you need to specify what searching you have already done and what you have found and what answers are not correct. There were at least two citations in that page of answers. Both were not what you wanted? This appears to be more a C question than an R question if you are just trying to find the origin of this implementation. – IRTFM Mar 17 '13 at 22:28
  • How come this is an answer? This answer resembles too much what a comment is :) – Alexander Mar 17 '13 at 23:58
  • It now appears it doesn't answer, but at the time it wasn't clear what kind of searching had been done and it certainly wasn't clear that the cited SO question and answers had been found. – IRTFM Mar 18 '13 at 00:58