Speeding up Rcpp `anyNA` equivalent

Question

This question is related to this old question and this old question.

R has the nice wrapper-ish function anyNA for quicker evaluation of any(is.na(x)). When working in Rcpp a similar minimal implementation could be given by:

// CharacterVector example
#include <Rcpp.h>
using namespace Rcpp;
template<typename T, typename S>
bool any_na(S x){
  T xx = as<T>(x);
  for(auto i : xx){
    if(T::is_na(i))
      return true;
  }
  return false;
}

// [[Rcpp::export(rng = false)]]
LogicalVector any_na(SEXP x){
  return any_na<CharacterVector>(x);
}

// [[Rcpp::export(rng = false)]]
SEXP overhead(SEXP x){
  CharacterVector xx = as<CharacterVector>(x);
  return wrap(xx);
}
/***R

library(microbenchmark)
vec <- sample(letters, 1e6, TRUE)
vec[1e6] <- NA_character_
any_na(vec)
# [1] TRUE
*/

But comparing the performance of this to anyNA I was surprised by the benchmark below

library(microbenchmark)
microbenchmark(
  Rcpp = any_na(vec), 
  R = anyNA(vec),
  overhead = overhead(vec), 
  unit = "ms"
)
Unit: milliseconds
     expr      min        lq     mean    median       uq      max neval cld
     Rcpp 2.647901 2.8059500 3.243573 3.0435010 3.675051 5.899100   100   c
        R 0.800300 0.8151005 0.952301 0.8577015 0.961201 3.467402   100  b 
 overhead 0.001300 0.0029010 0.011388 0.0122510 0.015751 0.048401   100 a

where the last line is the "overhead" incurred from converting back and forth from SEXP to CharacterVector (turns out to be negligible). As immediately evident the Rcpp version is roughly ~3.5 times slower than the R version. I was curious so I checked up on the source for Rcpp's is_na and finding no obvious reasons for the slow performance I continued to check the source for anyNA for R's own character vectors's and reimplementing the function using R's C API thinking to speed up this

// Added after SEXP overhead(SEXP x){ --- }
inline bool anyNA2(SEXP x){
  R_xlen_t n = Rf_length(x);
  for(R_xlen_t i = 0; i < n; i++){
    if(STRING_ELT(x, i) == NA_STRING)
      return true;
  }
  return false;
}
// [[Rcpp::export(rng = false)]]
SEXP any_na2(SEXP x){
  bool xx = anyNA2(x);
  return wrap(xx);
}
// [[Rcpp::export(rng = false)]]
SEXP any_na3(SEXP x){
  Function anyNA("anyNA");
  return anyNA(x);
}
/***R
microbenchmark(
  Rcpp = any_na(vec), 
  R = anyNA(vec),
  R_C_api = any_na2(vec),
  Rcpp_Function = any_na3(vec),
  overhead = overhead(vec),
  unit = "ms"
)
# Unit: milliseconds
# expr      min        lq       mean    median       uq      max neval  cld
# Rcpp 2.654901 2.8650515 3.54936501 3.2392510 3.997901 8.074201   100    d
# R 0.803701 0.8303015 1.01017200 0.9400015 1.061751 2.019902   100  b
# R_C_api 2.336402 2.4536510 3.01576302 2.7220010 3.314951 6.905101   100   c
# Rcpp_Function 0.844001 0.8862510 1.09259990 0.9597505 1.120701 3.011801   100  b
# overhead 0.001500 0.0071005 0.01459391 0.0146510 0.017651 0.101401   100 a
*/

Note that I've included a simple wrapper calling anyNA through Rcpp::Function as well. Once again this implementation of anyNA is not just a little but alot slower than the base implementation.

So the question becomes 2 fold:

Why is the Rcpp so much slower?
Derived from 1: How could this be "changed" to speed up the code?

The questions themselves are not very interesting in itself, but it is interesting if this is affecting multiple parts of Rcpp implementations that may in aggregate gain significant performance boosts.

SessonInfo()

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_Denmark.1252  LC_CTYPE=English_Denmark.1252    LC_MONETARY=English_Denmark.1252 LC_NUMERIC=C                     LC_TIME=English_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7    cmdline.arguments_0.0.1 glue_1.4.2              R6_2.5.0                Rcpp_1.0.6             

loaded via a namespace (and not attached):
 [1] codetools_0.2-18 lattice_0.20-41  mvtnorm_1.1-1    zoo_1.8-8        MASS_7.3-53      grid_4.0.3       multcomp_1.4-15  Matrix_1.2-18    sandwich_3.0-0   splines_4.0.3   
[11] TH.data_1.0-10   tools_4.0.3      survival_3.2-7   compiler_4.0.3

Edit (Not only a windows problem):

I wanted to make sure this is not a "Windows problem" so I went through and executed the problem within a Docker container running linux. The result is shown below and is very similar

# Unit: milliseconds
#           expr    min      lq     mean  median      uq     max neval
#           Rcpp 2.3399 2.62155 4.093380 3.12495 3.92155 26.2088   100
#              R 0.7635 0.84415 1.459659 1.10350 1.42145 12.1148   100
#        R_C_api 2.3358 2.56500 3.833955 3.11075 3.65925 14.2267   100
#  Rcpp_Function 0.8163 0.96595 1.574403 1.27335 1.56730 11.9240   100
#       overhead 0.0009 0.00530 0.013330 0.01195 0.01660  0.0824   100

Session info:

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-openmp/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] microbenchmark_1.4-7 Rcpp_1.0.5

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2

(Complete: Why are you using `inline`? It was a very useful package when we had nothing else (and still works fine) but Rcpp Attributes make things both more compact and readable.) — Dirk Eddelbuettel, Mar 08 '21 at 15:29
I am not entirely sure where I'm using `inline` @DirkEddelbuettel? I'm using `Rcpp::sourceCpp` to make the code reproducible here, while In practice the code is wrapped in in a package (made no difference in evaluation time) — Oliver, Mar 08 '21 at 15:31
Ah yes, my bad, Still odd as you could just cppFunction and or post a cpp file with embedded R code. To my eyes your post is basically unreadable as you insist on presenting all the C++ code (which is what matters) as a constant R string. Oh well. — Dirk Eddelbuettel, Mar 08 '21 at 15:32
Makes sense. I honestly didn't consider that. For the readers sake I'll edit. — Oliver, Mar 08 '21 at 15:33
Great! My (reasonably strong) preference is one (or more) C++ file(s), function names meaningful, and then `/*** R .... */` snippets to run the R code and benchmarks _with the same function names_. — Dirk Eddelbuettel, Mar 08 '21 at 15:35
Also, shooting completely from the hip, your first statement in your first function *forces* a conversion which the (already vectorized too and compiled code underneath) R function won't have. It will be hard to beat someone in a race if you tie your legs together... — Dirk Eddelbuettel, Mar 08 '21 at 15:36
I believe the reimplementation using R's C-api should account for this, as there wouldn't be any conversion at this point? Also I added "overhead" to check the conversion factor to `Charactervector` and it seems negligible. And this example is still "much" slower (Eg. I may be a bit misleading by saying this is an "rcpp" problem) Question still stands "why" and "how" to make up for it however. :-) — Oliver, Mar 08 '21 at 15:42
Using `times=1000, unit="relative"` and omitting the empty overhead function, we two pairs of essentially equivalent functions. What you re-demonstrated is that element access is not "free", and subsetting has a cost at the C and C++ level too. I now invite you to study the base R implementation in C if you think that you are in fact spending relevanrt cycles on this problem :) — Dirk Eddelbuettel, Mar 08 '21 at 17:27
Now I may be completely misunderstanding your point @DirkEddelbuettel, but I already did follow your suggestion and researched the [R source code](https://github.com/wch/r-source/blob/676105cde041b5db7416c0a0df112512c9350b7a/src/main/coerce.c#L2357) (`any_na2` follows the base implementation). Now in my book this should be comparing apples and apples, and the result of `any_na2` is only slightly faster than `any_na` and years slower than `anyNA` in base R. Changing to relative time does not change this. Please prove me wrong by beating `any_na2` with some "black magic" :-) — Oliver, Mar 08 '21 at 17:47
I don't have enough spare time to write code on demand for you, sorry. You may also have misunderstood (or maybe I did, again, it happens). But I don't think you posted anything resembling the actual underlying implementation in R. — Dirk Eddelbuettel, Mar 08 '21 at 17:57
Two relevant links are the default method https://github.com/r-devel/r-svn/blob/master/src/main/coerce.c#L2297-L2406 and the entry point https://github.com/r-devel/r-svn/blob/master/src/main/coerce.c#L2408-L2436 Rcpp gives you (a lot of) convenience to not have to deal with code at this lower level of abstraction. That may involve a cost of a few nanoseconds here or there. You can pick if you can afford that if you need to spend hours writing and debugging lower-level code. Having the ability to choose is a good thing. — Dirk Eddelbuettel, Mar 08 '21 at 18:04
Please @DirkEddelbuettel follow the link in my answer to your comment and in the question itself. (They go directly to `doANY` in R source). Understandable. We all have limited time. `do_any2` resembles line 2357 - 2360 for `STRSXP` in `anyNA` in the R C-source, called by `do_anyNA` (`do_anyNA` is called by `anyNA` from R). The remaining part of the code handles error checking and other input types (such as `VECSXP`). So I have simply eliminated "unnecessary" parts of `do_anyNA` and `anyNA` in R-source and implemented them in `any_na2`. I can't find the pears in the example. — Oliver, Mar 08 '21 at 18:06
A bit of confusion it directs to `anyNA` in R's C-source code. There is no leftover `Rcpp` in `any_na2` outside the boilerplate left by Rcpp-Attributes (accounting for less than 1 % of the time difference). Thank you for your time, hopefuly another soul has an idea why the time difference is there. Good evening/day on your side of the world. :-) — Oliver, Mar 08 '21 at 18:21

thc · Accepted Answer · 2021-03-09T06:53:31.063

8

This is an interesting question, but the answer is pretty simple: there are two versions of STRING_ELT one used internally by R or if you set the USE_RINTERNALS macro in Rinlinedfuns.h and one for plebs in memory.c.

Comparing the two versions, you can see that the pleb version has more checks, which fully accounts for the difference in speed.

If you really want speed and don't care about safety, you can usually beat R by at least a little bit.

// [[Rcpp::export(rng = false)]]
bool any_na_unsafe(SEXP x) {
  SEXP* ptr = STRING_PTR(x);
  R_xlen_t n = Rf_xlength(x);
  for(R_xlen_t i=0; i<n; ++i) {
    if(ptr[i] == NA_STRING) return true;
  }
  return false;
}

Bench:

> microbenchmark(
+   R = anyNA(vec),
+   R_C_api = any_na2(vec),
+   unsafe = any_na_unsafe(vec),
+   unit = "ms"
+ )
Unit: milliseconds
    expr    min      lq     mean  median      uq     max neval
       R 0.5058 0.52830 0.553696 0.54000 0.55465  0.7758   100
 R_C_api 1.9990 2.05170 2.214136 2.06695 2.10220 12.2183   100
  unsafe 0.3170 0.33135 0.369585 0.35270 0.37730  1.2856   100

Although as written this is unsafe, if you add a few checks before the loop in the beginning it'd be fine.

edited Mar 09 '21 at 06:53

answered Mar 09 '21 at 06:45

thc

9,527
1
24
39

Wow great answer. I had tried to find if there were some pointer minutia going around. I thought I had checked for it all, but I did not expect that I overlooked a second definition of `*_ELT`. I am still confused why it is so different (`anyNA` calls `STRING_ELT` at each iteration) but this does explain the time difference. Thanks for going the extra mile. I was planning to add a bounty. It'll end up going to your answer it seems. :-) – Oliver Mar 09 '21 at 09:05
Ah, yeah brainfart. `CHK` or `CHKZLN` is called upon each value checking if they are `UNPROTECT`ed. Makes sense. – Oliver Mar 09 '21 at 09:25
Nicely done. BTW I recommend removing `unit="ms"` as `microbenchmark` knows best what to display, which can be `nanoseconds`. And `unit="relative"` can be illuminating too. – Dirk Eddelbuettel Mar 09 '21 at 13:40
And I was thinking of the (old) benchmark examples we have in the Rcpp package: basically nothing can bear just assigning a pointer and incrementing it saving the (function call overheads) with subsetting operators. But I forgot about these macros doing just that, so glad you put this out. – Dirk Eddelbuettel Mar 09 '21 at 13:45

Dirk Eddelbuettel · Answer 2 · 2021-03-09T13:42:08.853

This questions turns out to be a good example of why some people rail and rant against microbenchmarks.

Baseline is a built-in primitive

The function that is supposed to be beat here is actually a primitive so that makes it a little tricky already

> anyNA
function (x, recursive = FALSE)  .Primitive("anyNA")
>

ALTREP puts a performance floor down

Next, a little experiment shows that the baseline function anyNA() never loops. We define a very short vector srt and a long vector lng, both contain a NA value. Turns out ... R is optimised via ALTREP keeping a matching bit in the data structure headers and the cost of checking is independent of length:

> srt <- c("A",NA_character_); lng <- c(rep("A", 1e6), NA_character_)
> microbenchmark(short=function(srt) { anyNA(srt) }, 
+                long=function(lng) { anyNA(lng) }, times=1000)
Unit: nanoseconds
  expr min lq   mean median uq   max neval cld
 short  48 50 69.324     51 53  5293  1000   a
  long  48 50 92.166     51 52 15494  1000   a
>

Note the units here (nanoseconds) and time spent. We are measuring looking at single bit.

(Edit: Scrab that. Thinko of mine in a rush, see comments.)

Rcpp functions have some small overhead

This is not new and documented. If you look at the code generated by Rcpp Attributes, conveniently giving us an R function of the same name of the C++ function we designate you see that at least one other function call is involved. Plus a baked-in try/catch layer, RNG setting (here turned off) and so on. That cannot be zero, and if amortized against anything reasonable it does neither matter not show up in measurements.

Here, however, the exercise was set up to match a primitive function looking at one bit. It's a race one cannot win. So here is my final table

> microbenchmark(anyNA = anyNA(vec), Rcpp_plain = rcpp_c_api(vec), 
+     Rcpp_tmpl = rcpp_any_na(vec), Rcpp_altrep = rcpp_altrep(vec), 
+     times = .... [TRUNCATED] 
Unit: microseconds
        expr      min      lq     mean   median      uq      max neval  cld
       anyNA  643.993  658.43  827.773  700.729  819.78  6280.85  5000 a   
  Rcpp_plain 1916.188 1952.55 2168.708 2022.017 2191.64  8506.71  5000    d
   Rcpp_tmpl 1709.380 1743.04 1933.043 1798.788 1947.83  8176.10  5000   c 
 Rcpp_altrep 1501.148 1533.88 1741.465 1590.572 1744.74 10584.93  5000  b

It contains the primitive R function, the original (templated) C++ function which looks pretty good still, something using Rcpp (and its small overhead) with just C API use (plus the automatic wrappers in/out) a little slower -- and then for comparison a function from Michel's checkmate package which does look at the ALTREP bit. And it is barely faster.

So really what we are looking at here is overhead from function calls getting in the way of measurning a micro-operations. So no, Rcpp cannot be made faster than a highly optimised primitive. The question looked interesting, but was, at the end of the day, somewhat ill-posed. Sometimes it is worth working through that.

My code version follows below.

// CharacterVector example
#include <Rcpp.h>
using namespace Rcpp;
template<typename T, typename S>
bool any_na(S x){
    T xx = as<T>(x);
    for (auto i : xx){
        if (T::is_na(i))
            return true;
    }
    return false;
}

// [[Rcpp::export(rng = false)]]
LogicalVector rcpp_any_na(SEXP x){
    return any_na<CharacterVector>(x);
}

// [[Rcpp::export(rng = false)]]
SEXP overhead(SEXP x){
    CharacterVector xx = as<CharacterVector>(x);
    return wrap(xx);
}

// [[Rcpp::export(rng = false)]]
bool rcpp_c_api(SEXP x) {
    R_xlen_t n = Rf_length(x);
    for (R_xlen_t i = 0; i < n; i++) {
        if(STRING_ELT(x, i) == NA_STRING)
            return true;
  }
  return false;
}

// [[Rcpp::export(rng = false)]]
SEXP any_na3(SEXP x){
  Function anyNA("anyNA");
  return anyNA(x);
}

// courtesy of the checkmate package
// [[Rcpp::export(rng=false)]]
R_xlen_t rcpp_altrep(SEXP x) {
#if defined(R_VERSION) && R_VERSION >= R_Version(3, 5, 0)
    if (STRING_NO_NA(x))
        return 0;
#endif
    const R_xlen_t nx = Rf_xlength(x);
    for (R_xlen_t i = 0; i < nx; i++) {
        if (STRING_ELT(x, i) == NA_STRING)
            return i + 1;
    }
    return 0;
}


/***R
library(microbenchmark)

srt <- c("A",NA_character_)
lng <- c(rep("A", 1e6), NA_character_)
microbenchmark(short = function(srt) { anyNA(srt) },
               long = function(lng) { anyNA(lng) },
               times=1000)

N <- 1e6
vec <- sample(letters, N, TRUE)
vec[N] <- NA_character_
anyNA(vec)                      # to check


microbenchmark(
  anyNA       = anyNA(vec),
  Rcpp_plain  = rcpp_c_api(vec),
  Rcpp_tmpl   = rcpp_any_na(vec),
  Rcpp_altrep = rcpp_altrep(vec),
  #Rcpp_Function = any_na3(vec),
  #overhead = overhead(vec),
  times = 5000
#  unit="relative"
)
*/

With this question I learned microbenchmarks (for C and RCPP functions), what are primitives (are they written in assembly to be so fast?), that ALTREP means alternative representations (the phrase made me investigate) and that C++ is not so intimidating... (sure it is not easy). Thank you for that great answer Dirk! — Manu, Mar 08 '21 at 21:28
Hey Dirk, thank you for your answer (it is a great read!), I am not replicating your results on Windows R-4.0.3 so I will go through investigating a bit more (More specifically: `anyNA` is not static in performance over length). As a side note, I never looked to beat `anyNA` but just to uncover why the performance was *so* different. Also the overhead incurred by Rcpp attributes is so small in this context that it is insignificant. — Oliver, Mar 08 '21 at 21:52
Actually thinking twice, that is easy to find. `microbenchmark` doesn't evaluate the `function` as a call but simply the `function` as an expression. So your first benchmark only creates 2 functions 1e4 times, making it seem like there is "no iteration". (Check `microbenchmark(anyNA(srt), anyNA(lng))`) — Oliver, Mar 08 '21 at 21:56
Whoops. You are right there. I had an earlier variant with both `srt` and `lng` as arguments and couldn't explain to myself why they would differ. (`SEXP` don't contain the payload.) Anyway, I think am done here now. I still think you are looking at this the wrong way and am imagining a big question where there is none. An Rcpp function will never beat an R primitive particularly on an (essentially) unmeasurably small operation. So the upshot is: the performance is not "so different" as you claim. It's nanoseconds. — Dirk Eddelbuettel, Mar 08 '21 at 22:12
Once again thank you so much for your time Dirk. It is greatly appreciated, although I do not completely agree with your conclusion. "How is a function that is 1500 lines longer, but which ultimately calls the same 3 lines of code, more efficient than the function with only 3 lines of code?" (overstatement of lines may be present). I am guessing it is my lack of knowledge on optimizing C++ code to the core (and possibly I need to dig deeper in R-source too). Albeit the question still stands for if I'm lucky and someone passes by with the answer. ;-) — Oliver, Mar 08 '21 at 23:52
Sorry, but you are still not seeing the full picture. Put Rcpp aside for an hour or two. Write a pure C function. Use `R CMD COMPILE ...` and `R CMD SHLIB ...`, then `dyn.load()` and `do.call()` to be _perfectly_ free of Rcpp helper code. Then, and only then, compare line numbers. Oh, and marvel that you still will not beat a `.Primitive`. — Dirk Eddelbuettel, Mar 08 '21 at 23:59
Done already. In my specific case I cannot (or well I've set myself up to be unable to) use Rcpp attributes, so the package exports are made by hand. I removed all boiler plate and I know that it does not beat it. The question was "why" does it not get even close (as I'm not trying to beat, I am trying to understand) — Oliver, Mar 09 '21 at 00:04
Basically: I packaged up the code. Compared the results with and without Attributes. I compared it using `.External` vs `.Call`. I compared using `O2` and `O3` optimization flags. I compred the overhead of `.Primitive` (to the extend I could) using `sum` as a benchmark (here it is reduced to almost 0). Maybe the big picture is lost to me. :-) — Oliver, Mar 09 '21 at 00:12