Why are two random deviates needed to ensure uniform sampling of large integers with sample()?

Question

Given that the following are equivalent we might infer that R uses the same C runif function to generate uniform samples for sample() and runif()...

set.seed(1)
sample(1000,10,replace=TRUE)
#[1] 27 38 58 91 21 90 95 67 63  7

set.seed(1)
ceiling( runif(10) * 1000 )
#[1] 27 38 58 91 21 90 95 67 63  7

However they are not equivalent when dealing with big numbers ( n > 2^32 - 1 ):

set.seed(1)
ceiling( runif(1e1) * as.numeric(10^12) )
#[1] 265508663143 372123899637 572853363352 908207789995 201681931038 898389684968
#[7] 944675268606 660797792487 629114043899  61786270468

set.seed(1)
sample( as.numeric(10^12) , 1e1 , replace = TRUE )
#[1] 2655086629 5728533837 2016819388 9446752865 6291140337 2059745544 6870228465
#[8] 7698414177 7176185248 3800351852

Update

As @Arun points out the 1st, 3rd, 5th, ... of runif() approximate result of 1st, 2nd, 3rd... from sample().

It turns out that both functions call unif_rand() behind the scenes, however, sample, given an argument, n that is larger than the largest representable integer of type "integer" but representable as an integer as type "numeric" uses this static definition to draw a random deviate (as opposed to just unif_rand() as in the case of runif())...

static R_INLINE double ru()
   {
       double U = 33554432.0;
       return (floor(U*unif_rand()) + unif_rand())/U;
   }

With a cryptic note in the docs that...

Two random numbers are used to ensure uniform sampling of large integers.

Why are two random numbers needed to ensure uniform sampling of large integers?
What is the constant U for and why does it take the particular value 33554432.0?

What does "correct" mean in this context? These are supposed to be random (of course they are only *pseudo-random*, but still). — gung - Reinstate Monica, Nov 07 '13 at 14:29
It works until 10^9. But then I fail to reproduce your code, I run into an error for the `sample()` alternative: `Fehler in sample(as.numeric(10^12), 10, replace = TRUE) : ungültiges erstes Argument` (not valid first argument). What did you do to make R accept such high numbers? — albifrons, Nov 07 '13 at 14:32
Didn't this discussion take place w/ some other `R` PRNGs last month somewhere on SO? Someone dug deep into the code of some package and found it did "grab" two randoms per cycle. So it's really not surprising that two different functions could do different things with the pseudorandom sequence. — Carl Witthoft, Nov 07 '13 at 14:44
Looking for it... aha: http://stackoverflow.com/questions/18907600/erratic-seed-behavior-with-rbinomprob-0-5-in-r/18908655#18908655 — Carl Witthoft, Nov 07 '13 at 14:46
@BenBolker ah, so that explains this `/* Our PRNGs have at most 32 bit of precision, and all have at least 25 */` in the source code right before that definition. Thanks. — Simon O'Hanlon, Nov 07 '13 at 15:59
You should make this question answered. Maybe write your own answer and accept it. — Rémi, Dec 17 '13 at 20:47
@Rémi I don't consider the comments constitute an answer (I always accept when I do - check my history). There are pieces sure, but not something I feel is a whole answer. Feel free to contribute. — Simon O'Hanlon, Dec 17 '13 at 20:48

score 2 · Answer 1 · answered Dec 17 '13 at 20:53

The reason is that a 25-bit PRNG won't produce enough bits to generate all possible integer values in a range wider than 2^25. In order to give a non-zero probability to every possible integer value, it is necessary to call the 25-bit PRNG twice. With two calls (like in the code you cite), you get 50 random bits.

Note that a double has 53 bits of mantissa, so calling the PRNG twice is still short of 3 bits.

Why are two random deviates needed to ensure uniform sampling of large integers with sample()?

Update

1 Answers1