0

I am using R bigmemory package and Rcpp to handle big matrices (1 to 10 Million column x 1000 rows). Once I read an interger matrix consisting in 0, 2 and NA into a filebacked bigmemory matrix in R I would like to modify through C++ all the NA values in order to do imputation of the mean values per column or an arbitrary-value-imputation (I show here the latter).

Below is the Rcpp function I have written and that does not work. My hope was that calling BigNA(mybigmatrix@address) from within R could find the elements in the matrix that are NAs and modify its values directly in the backing file.

I think the problem might be in the evaluation of std::isnan(mat[j][i]). I checked this by creating an alternative function that counts the NA values with an accumulator and indeed did not count any NA. But once this is solved, I am also not sure if the expression mat[j][i] = 1 would modify the value in the backing file. Writing those statements feels intuitive for me having an R background but might be wrong.

Any help/suggestion would be very much appreciated.

#include <stdio.h>
#include <Rcpp.h>
#include <bigmemory/MatrixAccessor.hpp>
#include <numeric>
// [[Rcpp::depends(BH, bigmemory)]]
// [[Rcpp::depends(Rcpp)]]


// [[Rcpp::export]]
void BigNA(SEXP pBigMat) {
  /*
  * Imputation of "NA" values for "1" in a big 0, 2 NA matrix.
  */

  // Create the external bigmatrix pointer and iniciate matrix accessor
  XPtr<BigMatrix> xpMat(pBigMat);
  MatrixAccessor<int> mat = (*xpMat);

  // Iterater over the elements in a matrix and when NA is found, substitute for "1"
  for(int i=0; i< xpMat->ncol(); i++){
    for(int j=0; j< xpMat->nrow(); j++){
      if(std::isnan(mat[j][i])){ 
        mat[j][i] = 1;
      }
    }
  }
} 

2 Answers2

1

The problem stems from the difference between NA in R and NAN in C++.

MatrixAccessor<int> gives you an accessor for values of type int. Any number in R can be NA, but an int in C++ is never NAN. An optimizing compiler could completely ignore std::isnan(x) where x is of type int, as in your case.

To fix this, you could either:

  • Use MatrixAccessor<float> (or double). This implies actually storing a different data type.
  • Check what value you're actually getting for NA elements. I think you will find it is INT_MIN in C++ (-2147483648). Replace isnan(x) with x == INT_MIN.

Related: Extracting a column with NA's from a bigmemory object in Rcpp

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
1

Package bigmemory has some functions to check NAs.

Just add the header with #include <bigmemory/isna.hpp>. And replace std::isnan(mat[j][i]) by isna(mat[j][i]).

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • I will implement your suggestion and benchmark it with John's suggestion. Thanks. – Moisés Expósito Alonso Nov 13 '17 at 22:20
  • @MoisésExpósitoAlonso: I'd like to see the results of that benchmark. – John Zwinck Nov 14 '17 at 01:47
  • There should be no difference between the two implementations. The only difference is that it has already been implemented by the authors of **bigmemory** (for `char`, `short`, `integer` and `double`). And one should not reinvent the wheel. – F. Privé Nov 14 '17 at 06:41