1

I have a function that computes basic summary statistics from the rows (or columns) of a given Matrix and I am now trying to also use this function with a bigstatsr::FBM (I am aware that using columns should be more efficient). The reason I want to store the rows / columns in a vector is that I would like to compute quantiles with std::nth_element. If there is a different way to do that with out the vector I would be equally happy.

This is the code I use for a regular matrix.

// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>

using namespace Rcpp;

// [[Rcpp::export]]
Eigen::MatrixXd summaryC(Eigen::MatrixXd x,int nrow) {
  Eigen::MatrixXd result(nrow, 5);
  int indices[6] = {-1, 0,  249,  500,  750, 999};

  for (int i = 0; i < nrow; i++) {
    Eigen::VectorXd v = x.row(i);
    for (int q = 0; q < 5; ++q) {
      std::nth_element(v.data() + indices[q] + 1,
                       v.data() + indices[q+1],
                       v.data() + v.size());
      result(i,q) = v[indices[q+1]];
    }
  }
return result;
}

/*** R 
x <- matrix(as.numeric(1:1000000), ncol = 1000)
summaryC(x = x, nrow = 1000)
***/

However I struggle to do this with an FBM as I am not fully grasping the intricacies of how the FBM - Pointer works.

I tried the following without success:

// [[Rcpp::depends(BH, bigstatsr, RcppEigen)]]
// [[Rcpp::plugins(cpp11)]]
#include <bigstatsr/BMAcc.h>
#include <RcppEigen.h>



// [[Rcpp::export]]
Eigen::MatrixXd summaryCbig(Environment fbm,int nrow, Eigen::VecttorXi ind_col) {

  Eigen::MatrixXd result(nrow, 5);

  XPtr<FBM> xpMat = fbm["address"];
  BMAcc<double> macc(xpMat);

  int indices[6] = {-1, 0,  249,  500,  750, 999};

  for (int i = 0; i < nrow; i++) {

    Eigen::VectorXd v = macc.row(i); // this does not work
    Eigen::VectorXd v = macc(i,_); // this does not work
    SubBMAcc<double> maccr(XPtr, i, ind_col -1); // This did not work with Eigen::VectorXi, but works with const NumericVector&
    Eigen::VectorXd v = maccr // this does not work even for appropriate ind_col

    for (int q = 0; q < 5; ++q) {
      std::nth_element(v.data() + indices[q] + 1,
                       v.data() + indices[q+1],
                                         v.data() + v.size());
      macc(i,q) = v[indices[q+1]];
    }
  }
}
/*** R 
x <- matrix(as.numeric(1:1000000), ncol = 1000)
summaryCbig(x = x, nrow = 1000, ind_col = 1:1000)

***/

Any help would be greatly appreciated, thank you!

Update - the big_apply - approach

I implemented the approach twice with two differently sized matrices X1 and X2. Code for X1:

X1 <- FBM(1000, 1000, init 1e6)
X2 <- FBM(10000, 10000, init = 9999)
library(bigstatsr)
microbenchmark::microbenchmark(
  big_apply(X, a.FUN = function(X, ind) {
    matrixStats::rowQuantiles(X1[ind, ])
  }, a.combine = "rbind", ind = rows_along(X), ncores = nb_cores(), block.size = 500),

  big_apply(X, a.FUN = function(X, ind) {
    matrixStats::rowQuantiles(X1[ind, ])
  }, a.combine = "rbind", ind = rows_along(X), ncores = 1, block.size = 500),

  times = 5
)

When using X1 and block.size = 500, having 4 cores instead of 1 makes the task 5-10 times slower on my PC (4 CPU and using windows, unfortunately). using the bigger matrix X2 and leaving block.size with the default takes 10 times longer with 4 cores instead of the non-parallelized version.

Result for X2:

       min       lq      mean    median        uq       max neval
 16.149055 19.13568 19.369975 20.139363 20.474103 20.951676     5
  1.297259  2.67385  2.584647  2.858035  2.867537  3.226552     5
Nikos Bosse
  • 144
  • 11
  • What is the goal of the inner loop with `std::nth_element`? – F. Privé Aug 30 '18 at 13:27
  • I want to compute user specified quantiles. The approach is borrowed from: https://stackoverflow.com/questions/11964552/finding-quartiles (first answer). Essentially it is a a way of sorting without having to sort everything (although that would probably be an acceptable alternative) – Nikos Bosse Aug 30 '18 at 15:14
  • Is your matrix to big to fit in memory? Do you want to modify the original matrix or a copy? Do you want the quartiles for every row? – F. Privé Aug 30 '18 at 15:44
  • The input matrix is too big and I need the quantiles for every row of it. The output is not too big and should be a regular matrix. So I don't want to modify the input matrix, but it would not be a big problem if that happened (I would just save a copy of the input beforehand) – Nikos Bosse Aug 30 '18 at 16:02

1 Answers1

1

Assuming you have

library(bigstatsr)
X <- FBM(1000, 1000, init = 1:1e6)

I would not reinvent the wheel and use:

big_apply(X, a.FUN = function(X, ind) {
  matrixStats::rowQuantiles(X[ind, ])
}, a.combine = "rbind", ind = rows_along(X), ncores = nb_cores(), block.size = 500)

Choose the block.size (number of rows) wisely. Function big_apply() is very useful if you want to apply an R(cpp) function to blocks of the FBM.

Edit: Of course, parallelism will me slower for small matrices, because of OVERHEAD of parallelism (usually, 1-3 seconds). See the results for X1 and X2:

library(bigstatsr)
X1 <- FBM(1000, 1000, init = 1e6)
microbenchmark::microbenchmark(
  PAR = big_apply(X1, a.FUN = function(X, ind) {
    matrixStats::rowQuantiles(X[ind, ])
  }, a.combine = "rbind", ind = rows_along(X1), ncores = nb_cores(), block.size = 500),

  SEQ = big_apply(X1, a.FUN = function(X, ind) {
    matrixStats::rowQuantiles(X[ind, ])
  }, a.combine = "rbind", ind = rows_along(X1), ncores = 1, block.size = 500),

  times = 5
)

Unit: milliseconds
 expr        min        lq       mean    median         uq        max neval cld
  PAR 1564.20591 1602.0465 1637.77552 1629.9803 1651.04509 1741.59974     5   b
  SEQ   68.92936   69.1002   76.70196   72.9173   85.31751   87.24543     5  a 

X2 <- FBM(10000, 10000, init = 9999)
microbenchmark::microbenchmark(
  PAR = big_apply(X2, a.FUN = function(X, ind) {
    matrixStats::rowQuantiles(X[ind, ])
  }, a.combine = "rbind", ind = rows_along(X2), ncores = nb_cores(), block.size = 500),

  SEQ = big_apply(X2, a.FUN = function(X, ind) {
    matrixStats::rowQuantiles(X[ind, ])
  }, a.combine = "rbind", ind = rows_along(X2), ncores = 1, block.size = 500),

  times = 5
)

Unit: seconds
 expr       min        lq      mean    median        uq       max neval cld
  PAR  4.757409  4.958869  5.071982  5.083381  5.218098  5.342153     5  a 
  SEQ 10.842828 10.846281 11.177460 11.360162 11.416967 11.421065     5   b

The bigger your matrix is, the more you will gain from parallelism.

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Thank you. I will try to implent and to profile this later. For some reason I could not fully understand I always experienced a very significant overhead when using foreach / big_apply (I had used nb_cores and had left blocksize unchanged). The reason I wanted to do this with Rcpp is that it fits nicely into the rest of my work and that I can use openMP for parallelisation. – Nikos Bosse Aug 30 '18 at 16:43
  • By default, `big_apply()` splits columns. This may explain why you had problems. – F. Privé Aug 30 '18 at 17:00
  • That was unfortunately not the issue, I had reprogrammed my function to work with columns instead – Nikos Bosse Aug 30 '18 at 17:13
  • What is the problem? Could you post another question or fill an issue in the GitHub repo? Is this problem solved? – F. Privé Aug 30 '18 at 17:20
  • I added an edit to my original comment that is hopefully helpful. Parallelisation does not really seem to work in this instance. If it is possible I would therefore proabably prefer to use Rcpp and openMP. – Nikos Bosse Aug 30 '18 at 18:02
  • Very interesting. Using block.size = 500 made it faster compared to leaving block.size with the default (e.g. not defining it). How can I determine the optimal choice of block.size? – Nikos Bosse Aug 30 '18 at 19:13
  • Is there also a way to access the rows from within Rcpp or is that something rather impossible? – Nikos Bosse Aug 30 '18 at 19:15
  • Use `block_size(ncol(X), ncores)` as a good default. The default value is `block_size(nrow(X), ncores)` because it splits columns by default. For your first approach, I could get you a way to access a specific row, but I think your approach was wrong in the sense of that it manipulated pointers, not actual contiguous data. – F. Privé Aug 30 '18 at 19:29
  • Thank you very much for your help. If you have also have a way to access rows or columns directly I would very much appreciate it. I could then play around a little with it and get a deeper understanding. – Nikos Bosse Aug 30 '18 at 19:53
  • It's not a good idea. If you want to access tje j-th element of row i, just use `(i, j)`, there is no need to wrap around that, don't you think? It would trick you into think you're accessing contiguous data.. – F. Privé Aug 30 '18 at 19:58
  • but if I accessed them columnwise, wouldn't I then access contiguous data? I would still at the least out of curiousity be interested how the overhead of openMP compares to the implementation with big_apply – Nikos Bosse Aug 30 '18 at 20:00
  • If you access columns, you would get contiguous data. If you want a pointer, you could use `&macc(0, j)` to `&macc(0, j+1)` for a whole column. OpenMP is a different kind of parallelism. I think the overhead is minimal but you would have to reimplement everything yourself if you want to use it. You could look at RcppParallel if you're really interested. – F. Privé Aug 30 '18 at 20:04
  • I have a parallelized version with openMP for the first example in my post that is already working. I was hoping that if I could just copy the content of one row/column into an Eigen::Vector that this could also work. even if it doesn't, how would I exactly use &macc(0,j) to &macc(0,j+1)? Thank you! – Nikos Bosse Aug 30 '18 at 20:16
  • This is the same as `v.data()` and `v.data() + v.size()` in your code. For `v` ~ col(j). You should get more familiar with C++ before trying to parallelize your C++ code; this is not trivial. – F. Privé Aug 30 '18 at 20:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179135/discussion-between-nikos-bosse-and-f-prive). – Nikos Bosse Aug 30 '18 at 20:27