4

Is it possible to get the same sample of integers from Rcpp as from base R's sample?

I have tried using Rcpp::sample and Rcpp::RcppArmadillo::sample but they do not return the same values -- example code below. Additionally, the Quick Example section of post https://gallery.rcpp.org/articles/using-the-Rcpp-based-sample-implementation/ returns the same sample from Rcpp and base R, however, I cannot reproduce these results (I attach this code at the end).

Can this be done / what am I doing wrong please?

My attempts:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <RcppArmadilloExtensions/sample.h>

// [[Rcpp::export]]
Rcpp::IntegerVector mysamp1( int n) {
  Rcpp::IntegerVector v = Rcpp::sample(n, n);
  return v;
}

// [[Rcpp::export]]
Rcpp::IntegerVector mysamp2(int n) {  
  Rcpp::IntegerVector i = Rcpp::seq(1,n);
  Rcpp::IntegerVector v = wrap(Rcpp::RcppArmadillo::sample(i,n,false));
  return v;
}

// set seed https://stackoverflow.com/questions/43221681/changing-rs-seed-from-rcpp-to-guarantee-reproducibility
// [[Rcpp::export]]
void set_seed(double seed) {
  Rcpp::Environment base_env("package:base");
  Rcpp::Function set_seed_r = base_env["set.seed"];
  set_seed_r(std::floor(std::fabs(seed)));
}

// [[Rcpp::export]]
Rcpp::IntegerVector mysamp3( int n, int seed) {
  set_seed(seed); 
  Rcpp::IntegerVector v = Rcpp::sample(n, n);
  return v;
}


/***R
set.seed(1)
sample(10)
#  [1]  9  4  7  1  2  5  3 10  6  8
set.seed(1)
mysamp1(10)
#  [1]  3  4  5  7  2  8  9  6 10  1
set.seed(1)
mysamp2(10)
#  [1]  3  4  5  7  2  8  9  6 10  1
mysamp3(10, 1)
#  [1]  3  4  5  7  2  8  9  6 10  1

*/

Code from the Using the RcppArmadillo-based Implementation of R's sample() gallery post which return FALSE on my system:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadilloExtensions/sample.h>
using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector csample_char( CharacterVector x, 
                              int size,
                              bool replace, 
                              NumericVector prob = NumericVector::create()) {
  CharacterVector ret = RcppArmadillo::sample(x, size, replace, prob) ;
  return ret ;
}

/*** R
N <- 10
set.seed(7)
sample.r <- sample(letters, N, replace=T)

set.seed(7)
sample.c <- csample_char(letters, N, replace=T)

print(identical(sample.r, sample.c))
# [1] FALSE
*/
user2957945
  • 2,353
  • 2
  • 21
  • 40
  • 1
    Perhaps setting the `RNGKind` – akrun Feb 07 '20 at 19:08
  • 1
    Does this guide help: https://gallery.rcpp.org/articles/random-number-generation/ – MrFlick Feb 07 '20 at 19:12
  • 1
    Thanks akrun that did it. setting either `RNGversion("1.6.2")` or `RNGkind("Super")` for example produced the same results. – user2957945 Feb 07 '20 at 19:12
  • Thanks for the link @MrFlick ; – user2957945 Feb 07 '20 at 19:13
  • 1
    Nice quick answer :) I think we have an open issue ticket to update the sample implementations to what R does now. If someone has time to contribute... – Dirk Eddelbuettel Feb 07 '20 at 19:19
  • @DirkEddelbuettel; so is this change (having to explicitly set an rng) due to some change in R's sample algorithm? (I did check other samplers and runif and rnorm still produce the same results between Rcpp and R). Thanks – user2957945 Feb 07 '20 at 19:30
  • 1
    I forget the details but it a well-documented change in R's RNG that came about because someone noticed a bias in, IIRC, use of sampling (at very large N). So that why you you to turn an option on in R to get the old (matching) behaviour. Now someone with a moment of time time should update the Rcpp side to the _new_ behaviour in R so that you don't need to turn R back. – Dirk Eddelbuettel Feb 07 '20 at 19:34
  • 1
    It's not too urgent because as you say, basic RNG ops are of course consistent and Rcpp just passes through to R. But sample() is one step up in the foodchain. – Dirk Eddelbuettel Feb 07 '20 at 19:35
  • Thanks @DirkEddelbuettel; I remember reading about the change in the news file. Appreciate the details. Thanks & thanks all. – user2957945 Feb 07 '20 at 19:37
  • 1
    Here the comment I added over at https://stackoverflow.com/a/54951936/8416610 after you deleted your comment but before I reloaded the page ...: It is a known issue that these `sample` methods have not been updated to the (better) methods found in R 3.6 (c.f. https://github.com/RcppCore/Rcpp/issues/945 and https://github.com/RcppCore/RcppArmadillo/issues/250). You can switch R back to use the old (biased) method using `RNGkind(sample.kind = "Rounding")`. This should give you the same results as long as you are sampling from less than `1e7` elements. – Ralf Stubner Feb 07 '20 at 23:16
  • @RalfStubner; apologies I deleted my comment as I thought you hadn't seen it and I also thought it may be worth asking. But thanks for your time and the info -- added details to the answer. – user2957945 Feb 08 '20 at 03:57

1 Answers1

6

Compiling comments into an answer. Akrun noted that by setting RNGkind or RNGversion we can replicate results. From DirkEddelbuettel; there was a "change in R's RNG that came about because someone noticed a bias in, IIRC, use of sampling (at very large N). So thats why you you to turn an option on in R to get the old (matching) behaviour. " And RalfStubner indicates that this is a known issue: https://github.com/RcppCore/RcppArmadillo/issues/250 and https://github.com/RcppCore/Rcpp/issues/945

Presently R uses a different default sampler which leads to different results

RNGkind(sample.kind = "Rejection")
set.seed(1)
sample(10)
# [1]  9  4  7  1  2  5  3 10  6  8
set.seed(1)
mysamp1(10)
# [1]  3  4  5  7  2  8  9  6 10  1

However, an earlier version can be used using

RNGkind(sample.kind = "Rounding")
#Warning message:
#  In RNGkind("Mersenne-Twister", "Inversion", "Rounding") : non-uniform 'Rounding' sampler used

set.seed(1)
sample(10)
# [1]  3  4  5  7  2  8  9  6 10  1
set.seed(1)
mysamp1(10)
# [1]  3  4  5  7  2  8  9  6 10  1
user2957945
  • 2,353
  • 2
  • 21
  • 40