Fast checking of missing values in Rcpp

Question

This question is linked to NA values in Rcpp conditional.

I basically have some Rcpp code that loop over multiple (double) elements. And I need to check if there are missing values, for each element (and I can't use vectorization). Let's count the number of missing values in a vector, just as minimal reproducible example:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
  return c;
}

// [[Rcpp::export]]
int nb_na3(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) if (x[i] == 3) c++;
  return c;
}

// [[Rcpp::export]]
LogicalVector na_real(NumericVector x) {
  return x == NA_REAL;
}

Then, in R, we get:

> x <- rep(c(1, 2, NA), 1e4)

> x2 <- replace(x, is.na(x), 3)

> microbenchmark::microbenchmark(
+   nb_na(x),
+   nb_na3(x2)
+ )
Unit: microseconds
       expr     min      lq      mean  median       uq      max neval
   nb_na(x) 135.633 135.982 153.08586 139.753 140.3115 1294.928   100
 nb_na3(x2)  22.490  22.908  30.14005  23.188  23.5025  684.026   100

> all.equal(nb_na(x), nb_na3(x2))
[1] TRUE

> na_real(x[1:3])
[1] NA NA NA

As noted in the linked question, you can't just check x[i] == NA_REAL because it always returns a missing value. Yet, using R_IsNA(x[i]) is much slower that checking equality with a numeric value (e.g. 3).

Basically, I want a solution where I can check that a single value is a missing value. This solution should be as fast as checking equality with a numeric value.

Romain Francois · Accepted Answer · 2017-10-31T09:08:12.167

Checking for missing value or any NaN specific variant is always going to be more expensive than checking for a specific value. That's just floating point arithmetic.

However there's still room for improvement in your code. I would encourage you to use NumericVector::is_na instead of R_IsNA but this is mostly cosmetic.

Then branching can be expensive, i.e. I'd replace if (R_IsNA(x[i])) c++; by c += NumericVector::is_na(x[i]). This gives this version:

// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
  return c;
}

Then iterating on an int and accessing x[i] can be replaced by using the std::count_if algorithm. This is it's raison d'être. Leading to this version:

// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
  return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}

Now if the performance is still not good enough, you might want to try parallelization, for this I typically use the tbb library from the RcppParallel package.

// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
  return tbb::parallel_reduce( 
    tbb::blocked_range<const double*>(x.begin(), x.end()),
    0, 
    [](const tbb::blocked_range<const double*>& r, int init) -> int {
      return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
    }, 
    []( int x, int y){ return x+y; }
  ) ;
}

Benchmarking with this function:

library(microbenchmark)

bench <- function(n){
  x <- rep(c(1, 2, NA), n)
  microbenchmark(
    nb_na = nb_na(x), 
    nb_na4 = nb_na4(x), 
    nb_na5 = nb_na5(x), 
    nb_na6 = nb_na6(x)
  )
}
bench(1e5)

On my machine I get:

> bench(1e4)
Unit: microseconds
expr    min      lq      mean  median       uq     max neval  cld
nb_na  84.358 94.6500 107.41957 110.482 118.9580 137.393   100    d
nb_na4 59.984 69.4925  79.42195  82.442  85.9175 106.567   100  b  
nb_na5 65.047 75.2625  85.17134  87.501  93.0315 116.993   100   c 
nb_na6 39.205 51.0785  59.20582  54.457  68.9625  97.225   100 a   

> bench(1e5)
Unit: microseconds
expr     min       lq     mean   median       uq      max neval  cld
nb_na  730.416 732.2660 829.8440 797.4350 872.3335 1410.467   100    d
nb_na4 520.800 521.6215 598.8783 562.7200 657.1755 1059.991   100  b  
nb_na5 578.527 579.3805 664.8795 626.5530 710.5925 1166.365   100   c 
nb_na6 294.486 345.2050 368.6664 353.6945 372.6205  897.552   100 a

Another way is to totally circumvent floating point arithmetic and pretend the vector is a vector of long long, aka 64 bit integers and compare the values to the bit pattern of NA_REAL:

  > devtools::install_github( "ThinkR-open/seven31" )
  > seven31::reveal(NA, NaN, +Inf, -Inf )
  0 11111111111 ( NaN ) 0000000000000000000000000000000000000000011110100010 : NA
  0 11111111111 ( NaN ) 1000000000000000000000000000000000000000000000000000 : NaN
  0 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : +Inf
  1 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : -Inf

A serial solution using this hack:

// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  return std::count(p, p + x.size(), na ) ;

}

And then a parallel version:

// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
    return init + std::count( r.begin(), r.end(), na);
  } ;

  return tbb::parallel_reduce( 
    tbb::blocked_range<const long long*>(p, p + x.size()),
    0, 
    count_chunk, 
    []( int x, int y){ return x+y; }
  ) ;

}

  > bench(1e5)
  Unit: microseconds
     expr     min       lq     mean   median       uq      max neval    cld
    nb_na 730.346 762.5720 839.9479 857.5865 881.8635 1045.048   100      f
   nb_na4 520.946 521.6850 589.0911 578.2825 653.4950  832.449   100    d  
   nb_na5 578.621 579.3245 640.9772 616.8645 701.8125  890.736   100     e 
   nb_na6 291.115 307.4300 340.1626 344.7955 360.7030  484.261   100   c   
   nb_na7 122.156 123.4990 141.1954 132.6385 149.7895  253.988   100  b    
   nb_na8  69.356  86.9980 109.6427 115.2865 126.2775  182.184   100 a     

  > bench(1e6)
  Unit: microseconds
     expr      min        lq      mean    median        uq      max neval  cld
    nb_na 7342.984 7956.3375 10261.583 9227.7450 10869.605 79757.09   100    d
   nb_na4 5286.970 5721.9150  7659.009 6660.2390  9234.646 31141.47   100   c 
   nb_na5 5840.946 6272.7050  7307.055 6883.2430  8205.117 10420.48   100   c 
   nb_na6 2833.378 2895.7160  3891.745 3049.4160  4054.022 18242.26   100  b  
   nb_na7 1661.421 1791.1085  2708.992 1916.6055  2232.720 60827.63   100 ab  
   nb_na8  650.639  869.6685  1289.373  939.0045  1291.025 10223.29   100 a

This assumes there's only one bit pattern to represent NA.

Here's my entire file for reference:

#include <Rcpp.h>
#include <RcppParallel.h>

// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;

// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
  return c;
}

// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
  int n = x.size();
  int c = 0;
  for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
  return c;
}

// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
  return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}

// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
  return tbb::parallel_reduce( 
    tbb::blocked_range<const double*>(x.begin(), x.end()),
    0, 
    [](const tbb::blocked_range<const double*>& r, int init) -> int {
      return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
    }, 
    []( int x, int y){ return x+y; }
  ) ;
}

// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  return std::count(p, p + x.size(), na ) ;

}

// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
  const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
  long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

  auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
    return init + std::count( r.begin(), r.end(), na);
  } ;

  return tbb::parallel_reduce( 
    tbb::blocked_range<const long long*>(p, p + x.size()),
    0, 
    count_chunk, 
    []( int x, int y){ return x+y; }
  ) ;

}

/*** R
library(microbenchmark)

bench <- function(n){
  x <- rep(c(1, 2, NA), n)
  microbenchmark(
    nb_na = nb_na(x), 
    nb_na4 = nb_na4(x), 
    nb_na5 = nb_na5(x), 
    nb_na6 = nb_na6(x), 
    nb_na7 = nb_na7(x), 
    nb_na8 = nb_na8(x)
  )
}
bench(1e5)
bench(1e6)
*/

Thanks for this answer. I tested solutions 4 & 5. On my computer, they are ~1.6 times slower (for both `n = 1e4` and `n = 1e5`) — F. Privé, Oct 24 '17 at 12:12
Maybe this is about optimisation setting, I have this in my `~/.R/Makevars: ` `CXXFLAGS=-O3` and `CXX11FLAGS=-O3` — Romain Francois, Oct 24 '17 at 12:14
I've once tested the `-03` optimization on one of my algo. It resulted in a slower function. So I prefer not to mess with R defaults for compiler options. — F. Privé, Oct 24 '17 at 12:17
Sorry, but I still don't understand. An `NA` coming from R and coerced in a double in Rcpp will always have the same bit encoding, right? So why can't we simply test that all the bits are equal (without having to coerce to an int64)? There is no function for comparing bits in C++? — F. Privé, Nov 07 '17 at 15:29
Maybe you want this. http://www.cplusplus.com/reference/cstring/memcmp/ This is what the Rcpp solution for checking for NA is based on. Maybe this will help you: https://en.wikipedia.org/wiki/NaN — Romain Francois, Nov 07 '17 at 15:37
Yes. I've tested it. Indeed, it's not fast. Don't understand why though. — F. Privé, Nov 07 '17 at 16:40
Careful: `na7` and `na8` violate C++ strict aliasing rules and are therefore invalid C++. I don’t know whether they currently break on any compiler but I would be extremely wary. This can be avoided by instead using `std::copy` (or `std::memcpy`) on bytes to convert individual memory locations, and [the copy can be elided](https://stackoverflow.com/a/2959055/1968), so that workaround might be just as efficient. — Konrad Rudolph, Nov 08 '17 at 16:35

HenrikB · Answer 2 · 2017-10-23T21:12:59.433

4

Checking for (IEEE) missing floating-point values is an expensive operating and there is no way around it. This is unrelated to R.

This is one reason why we're excited about the upcoming ALTREP in R - there we can for instance keep track of whether a double/real vector contains missing values or not - if it doesn't, then we don't have to waste time looking for them. Although not updated to mention ALTREP, you can get the gist from https://github.com/HenrikBengtsson/Wishlist-for-R/issues/12

edited Oct 23 '17 at 21:12

answered Oct 23 '17 at 15:08

HenrikB

6,132
31
34

ALTREP is promising. Yet, I don't want to check if there is some missing value or not at all. Basically, you can consider that there will always be multiple missing values randomly spread over the vector (or matrix). – F. Privé Oct 23 '17 at 15:31
Regarding ALTREP: Theoretically, I can imagine that an ALTREP implementation could keep an NA counter (rather than has/has-not), but not sure if that would be a common-enough need for it to go into R core (even though you probably could implement it yourself). Disclaimer: I'm commenting without having gone into the ALTREP details enough. – HenrikB Oct 23 '17 at 18:10
I see two problems: what will be the count when you subset? Moreover, I'm using vectors and couting NAs, just an example for this question. – F. Privé Oct 24 '17 at 06:15
ALTREP: You could imagine that subsetting either sets the NA counter to "unknown" (forcing a "lazy" update when/if counter is needed) or recalculate directly before returning (more costly). I _guess_ that's the idea how ALTREP can be used. – HenrikB Oct 26 '17 at 03:17

Fast checking of missing values in Rcpp

2 Answers2