3

This is a follow-on question from this one: Generating same random variable in Rcpp and R

I'm trying to speed up a vectorised call to rbinom of this form:

    x <- c(0.1,0.4,0.6,0.7,0.8)
    rbinom(length(x),1 ,x)

In the live code of x is a vector of variable length (but typically numbering in the millions). I have no experience with Rcpp but I was wondering could I use Rcpp to speed this up. From the linked question this Rcpp code was suggested for non-vectorised rbinom calls by @Dirk Eddelbuettel :

    cppFunction("NumericVector cpprbinom(int n, double size, double prob) { \
         return(rbinom(n, size, prob)); }")
    set.seed(42); cpprbinom(10, 1, 0.5)

....and is about twice as fast as the non Rcpp option, but can't handle my vectorised version

    cpprbinom(length(x), 1, x)

How can the Rcpp code be modified to implement this?

Thanks

Community
  • 1
  • 1
user2498193
  • 1,072
  • 2
  • 13
  • 32

2 Answers2

7

Following Dirk's response here:

Is there a way of fixing the code without using an explicit loop in the C++ code?

I don't think so. The code currently has this hard-wired: <...> so until one of us has sufficient [time] to extend this (and test it) will have to do the loop at your end.

Here's my implementation of a "vectorised" code:

library(Rcpp)
cppFunction("NumericVector cpprbinom(int n, double size, NumericVector prob) { 
    NumericVector v(n);            
    for (int i=0; i<n; i++) {v[i] = as<double>(rbinom(1, size, prob[i]));} 
    return(v); }")
r <- runif(1e6)
all.equal({set.seed(42); rbinom(length(r), 1, r)}, 
          {set.seed(42); cpprbinom(length(r), 1, r)})
#TRUE

But the problem is (again citing Dirk),

And I suggest that before expending a lot of effort on this you check whether you are likely to do better than the R function rbinom. That R function is vectorized in C code and you are unlikely to make things much faster by using Rcpp, unless you want to use the random variates in another C++ function.

And it is actually slower (x3 on my machine), so at least such naive implementation as mine won't help:

library(microbenchmark)
microbenchmark(rbinom(length(r), 1, r), cpprbinom(length(r), 1, r))

Unit: milliseconds
                       expr       min        lq      mean    median        uq       max neval
    rbinom(length(r), 1, r)  55.50856  56.09292  56.49456  56.45297  56.65897  59.42524   100
 cpprbinom(length(r), 1, r) 117.63761 153.37599 154.94164 154.29623 155.37247 225.56535   100

EDIT: according to Romain's comment below, here's an advanced version, which is faster!

cppFunction(plugins=c("cpp11"), "NumericVector cpprbinom2(int n, double size, NumericVector prob) { 
    NumericVector v = no_init(n);
    std::transform( prob.begin(), prob.end(), v.begin(), [=](double p){ return R::rbinom(size, p); }); 
    return(v);}")
r <- runif(1e6)
all.equal({set.seed(42); rbinom(length(r), 1, r)}, 
          {set.seed(42); cpprbinom(length(r), 1, r)}, 
          {set.seed(42); cpprbinom2(length(r), 1, r)})
#TRUE
microbenchmark(rbinom(length(r), 1, r), cpprbinom(length(r), 1, r), cpprbinom2(length(r), 1, r))

Unit: milliseconds
                        expr       min        lq      mean    median        uq       max neval
     rbinom(length(r), 1, r)  55.26412  56.00314  56.57814  56.28616  56.59561  60.01861   100
  cpprbinom(length(r), 1, r) 113.72513 115.94758 122.81545 117.24708 119.95134 168.47246   100
 cpprbinom2(length(r), 1, r)  36.67589  37.12182  38.95318  37.37436  37.97719  84.73516   100
tonytonov
  • 25,060
  • 16
  • 82
  • 98
  • 6
    When you do `NumericVector v(n);` you pay the price for initializing all the values to `0`. Use `NumericVector v = no_init(n);` instead. Using `Rcpp::rbinom` creates an R object each time, that's not free and that's useless, use the `R::rbinom` instead for scalars. Perhaps something like: `std::transform( prob.begin(), prob.end(), v.begin(), [=]( double p){ return R::rbinom(size, prob) ;}) ;` – Romain Francois Apr 03 '15 at 13:18
  • 1
    @RomainFrancois Thank you for pointing that out, very helpful. You can check the updated benchmark above. – tonytonov Apr 03 '15 at 13:52
  • 1
    Thank you both for this - its my first time trying to do anything via Rcpp and so I can barely follow the code but this does give me equivalent results and a significant speed boost in my test data set which went from about 120secs down to about 90secs! Happy with that! – user2498193 Apr 03 '15 at 14:09
5

Not a general solution, but I'm noticing that you set the size argument to 1 in your call to rbinom. If that's always the case, you can draw length(x) uniform values and then comparing to x. For instance:

 set.seed(123)
 #create the values
 x<-runif(1000000)
 system.time(res<-rbinom(length(x),1 ,x))   
 # user  system elapsed 
 #0.068   0.000   0.070
 system.time(res2<-as.integer(runif(length(x))<x))   
 # user  system elapsed 
 #0.044   0.000   0.046

Not a huge gain, but maybe you can save some little time if you call runif from C++, avoiding some overhead.

nicola
  • 24,005
  • 3
  • 35
  • 56
  • Hi nicola that is indeed faster but is mathematically non-equivalent. Try this to see `x<-runif(1000000) set.seed(123); res1<-rbinom(length(x),1 ,x)) set.seed(123); res2<-as.integer(runif(length(x)) – user2498193 Apr 03 '15 at 13:40
  • 2
    What? Are you sure about what you are saying? Also, I don't get what the `hist` should prove. They _are_ equivalent: how would you extract a number between 0 and 1 if 1 has a p probability and 0 a 1-p prob? – nicola Apr 03 '15 at 13:45
  • The point is `hist(res1)` shows two columsn with a 30:70 split, whilst `hist(res2)` shows two cloumns with a 50:50 split and much more values. Try `table(res1) ; table(res2)`. Unless I've done something wrong ? – user2498193 Apr 03 '15 at 13:50
  • 1
    You must be doing something wrong; I'm getting 50/50 in both cases. Also, you have a couple of extra `)` in your comment. – nicola Apr 03 '15 at 13:57
  • Oh my apologies I had stuff in memory also called res1 & res2 and so didn't notice when those extra `)`s didn't work. However I still think that the two methods are not mathematically the same - this approach changes my results. Try this toy test (and yes I cleared my memory this time :) `a <- rep(0.7, 100000) ; res1 <- rbinom(length(a),1,a) ; res2 <- as.integer(runif(length(a) – user2498193 Apr 03 '15 at 14:06
  • Oh no! I had a ) in wrong place. Apologies you were correct all along!. – user2498193 Apr 03 '15 at 14:12
  • So this does speed up the code - but not quite as much as the answer from tony and Romain. Thank you in any case for the solution – user2498193 Apr 03 '15 at 15:48