Why a self-written Rcpp vectorized mathematical function is faster than its base counterpart?

Question

OK, I know the answer, but being inspired by this question, I'd like to get some nice opinions about the following: Why the Rcpp exercise below is ca. 15% faster (for long vectors) than the built-in exp()? We all know that Rcpp is a wrapper to the R/C API, so we should expect a slightly worse performance.

Rcpp::cppFunction("
   NumericVector exp2(NumericVector x) {
      NumericVector z = Rcpp::clone(x);
      int n = z.size();
      for (int i=0; i<n; ++i)
         z[i] = exp(z[i]);
      return z;
   }
")

library("microbenchmark")
x <- rcauchy(1000000)
microbenchmark(exp(x), exp2(x), unit="relative")
## Unit: relative
##     expr      min       lq   median       uq      max neval
##   exp(x) 1.159893 1.154143 1.155856 1.154482 0.926272   100
##  exp2(x) 1.000000 1.000000 1.000000 1.000000 1.000000   100

Best guess: difference in compiler flags / optimization levels between R as compiled from CRAN and the `cppFunction` compiled locally? That and potentially more opportunities for optimization when compiling Rcpp code (through leveraging of type information encoded in Rcpp) — Kevin Ushey, Oct 17 '14 at 18:30
@KevinUshey: I have compiled R on my own. For Rcpp I use the same compiler flags. — gagolews, Oct 17 '14 at 18:32
"OK, I know the answer" - so share your answer too, not only the question. — Carlos Cinelli, Oct 17 '14 at 18:45
@gagolews For 1e6 randoms yes, for 1e8 it seems to be almost equal (though `exp` was still in front), but I don't have the patience for extended benchmarking (it takes pretty long). — Roland, Oct 17 '14 at 18:50

score 8 · Accepted Answer · answered Oct 17 '14 at 18:55

8

Base R tends to do more checking for NA so we can win a little by not doing that. Also note that by doing tricks like loop unrolling (as done in Rcpp Sugar) we can do little better still.

So I added

Rcpp::cppFunction("NumericVector expSugar(NumericVector x) { return exp(x); }")

and with that I get a further gain -- with less code on the user side:

R> microbenchmark(exp(x), exp2(x), expSugar(x), unit="relative")
Unit: relative
        expr     min      lq    mean  median      uq     max neval
      exp(x) 1.11190 1.11130 1.11718 1.10799 1.08938 1.02590   100
     exp2(x) 1.08184 1.08937 1.07289 1.07621 1.06382 1.00462   100
 expSugar(x) 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000   100
R>

answered Oct 17 '14 at 18:55

Dirk Eddelbuettel

360,940
56
644
725

I'm not sure I understand your point regarding NA checking. `exp2` handles `NA`, `Inf`, `NaN` input just fine. – Roland Oct 17 '14 at 18:57
@Roland Rcpp has (slightly) faster NA checking than base R does. – Kevin Ushey Oct 17 '14 at 19:00
`exp` is still faster on my mac if I do the same benchmarks. – Roland Oct 17 '14 at 19:04
3

@Roland: Why not read the source at https://github.com/wch/r-source/blob/trunk/src/main/arithmetic.c#L1156 instead of arguing with me? Base R functions do more arg checking, validity checking, this, that and the other which the five line function by the OP does not do. – Dirk Eddelbuettel Oct 17 '14 at 19:05
@Roland: exp() itself (it's not an Rcpp fun) does no NA checking here; it just bases on the fact that `NA` is represented as a special kind of `NaN`. `NaN` and `Inf` handling is done by the FPU itself. – gagolews Oct 17 '14 at 19:22

score 6 · Answer 2 · answered Oct 21 '14 at 19:36

If you really want to get performance improvements, code has to be written to leverage underlying hardware concurrency. You can do this using the RcppParallel package and its parallelFor would be an ideal vessel for this.

You can also try a more modern implementation of R/C++. The next version of Rcpp11, released in a few days will feature automatically threaded sugar, making the expSugar from the previous answer better.

Consider:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
NumericVector exp2(NumericVector x) {
   NumericVector z = Rcpp::clone(x);
   int n = z.size();
   for (int i=0; i<n; ++i)
      z[i] = exp(z[i]);
   return z;
}

// [[Rcpp::export]]
NumericVector expSugar(NumericVector x) {
    return exp(x) ;
}

/*** R
    library(microbenchmark)
    x <- rcauchy(1000000)
    microbenchmark(exp(x), exp2(x), expSugar(x))
*/

With Rcpp I get:

$ RcppScript /tmp/exp.cpp

> library(microbenchmark)

> x <- rcauchy(1e+06)

> microbenchmark(exp(x), exp2(x), expSugar(x))
Unit: milliseconds
        expr      min       lq   median       uq      max neval
      exp(x) 7.027006 7.222141 7.421041 8.631589 21.78305   100
     exp2(x) 6.631870 6.790418 7.064199 8.145561 31.68552   100
 expSugar(x) 6.491868 6.761909 6.888111 8.154433 27.36302   100

So nice, but somewhat anecdotic improvement which can be explained by various inlining, etc ... as described in other answers and comments.

With Rcpp11 and automatic threaded sugar, I get:

$ Rcpp11Script /tmp/exp.cpp

> library(microbenchmark)

> x <- rcauchy(1e+06)

> microbenchmark(exp(x), exp2(x), expSugar(x))
Unit: milliseconds
        expr      min       lq   median       uq      max neval
      exp(x) 7.029882 7.077804 7.336214 7.656472 15.38953   100
     exp2(x) 6.636234 6.748058 6.917803 7.017314 12.09187   100
 expSugar(x) 1.652322 1.780998 1.962946 2.261093 12.91682   100

I'm really looking forward to the Rcpp11 CRAN premiere! Good luck, guys! — gagolews, Oct 21 '14 at 19:38

Why a self-written Rcpp vectorized mathematical function is faster than its base counterpart?

2 Answers2