Fastest way to multithread doing quickselect on all columns or all rows of a matrix in Rcpp - OpenMP, RcppParallel or RcppThread

Question

I was using this Rcpp code to do a quickselect on a vector of values, i.e. obtain the kth largest element from a vector in O(n) time (I saved this as qselect.cpp):

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export]]
double qSelect(arma::vec& x, const int k) {

  // ARGUMENTS
  // x: vector to find k-th largest element in
  // k: desired k-th largest element

  // safety copy since nth_element modifies in place
  arma::vec y(x.memptr(), x.n_elem);

  // partially sort y in O(n) time
  std::nth_element(y.begin(), y.begin() + k - 1, y.end());

  // the k-th largest value
  const double kthValue = y(k-1);

  return kthValue;

}

I was using this as a fast way to calculate a desired percentile. E.g.

n = 50000
set.seed(1)
x = rnorm(n=n, mean=100, sd=20)
tau = 0.01 # desired percentile
k = tau*n+1 # here we will get the 6th largest element
library(Rcpp)
Rcpp::sourceCpp('qselect.cpp')
library(microbenchmark)
microbenchmark(qSelect(x,k)) # 53.32917, 548 µs
microbenchmark(sort(x, partial=k)[k]) # 53.32917, 694 µs = pure R solution

[This may look like it's already fast but I need to do this millions of time in my application]

Now I would like to modify this Rcpp function so that it would do a multithreaded quickselect on all columns or all rows of an R matrix, and return the result as a vector. As I am a bit of a novice in Rcpp I would like some advice though on which framework would likely be fastest for this & would be easiest to code (it would have to work easily cross-platform & I would need good control over the nr of threads to use). Using OpenMP, RcppParallel or RcppThread? Or even better - if someone could perhaps demonstrate a fast and elegant way to do this?

BTW the common way for `microbenchmark()` use is to toss both (competing) expressions into one call: `microbenchmark(qSelect(x,k), sort(x, partial=k)[k]))`. — Dirk Eddelbuettel, Feb 23 '19 at 21:50

score 4 · Answer 1 · answered Feb 23 '19 at 21:28

4

Yes, that would be a candidate for a multithreaded variant -- but as the RcppParallel documentation will tell you, one requirement for parallel code is non-R memory and here we use RcppArmadillo in our efficient zero copy way meaning it is R memory.

So you may need to trade off extra data copies (into, say, the RMatrix type RcppParallel uses) which parallel execution.

But because your algorithm is simple and column-wise you could also experiment with a single OpenMP loop in your function above: pass it a matrix, have it loop over columns using #pragma for.

answered Feb 23 '19 at 21:28

Dirk Eddelbuettel

360,940
56
644
725

So are you saying that OpenMP would be the only framework here where i could readily avoid making extra data copies? Is OpenMP already commonly used in CRAN packages actually? (I have the feeling not, and was a little afraid for Mac users, as I believe they sometimes have problems setting up OpenMP) – Tom Wenseleers Feb 23 '19 at 21:33
I'd never say anything as absolute as only the tidyverse has the benefit of knowing everything with certitude :) You probably want to benchmark and try different approaches. _From the top of my head_ it seemed like the copy (which would be the safe thing to do) may be costly enough to not make the parallel code be worth it. But then again ... Conrad also uses OpenMP inside of Armadillo and looping over columns of a (assumed fixed for the call) blob of data seems safe. So you _may_ get away not copying away from `SEXP` data. But you know about proofs and pudding ... – Dirk Eddelbuettel Feb 23 '19 at 21:49
Thanks for the advice! I made an edit in my question with now some OpenMP code included - this was my first attempt and I'm really no expert - so if you see anything that could be done better let me know! But I got about a 2x speed increase by doing an apply in Rcpp and a further 2x speed increase using OpenMP multithreading... – Tom Wenseleers Feb 24 '19 at 00:44
There is still a bit of copying going on in my code though, and i think because of that the OpenMP version is not faster for small n. If you would happen to know how to avoid this copying let me know! – Tom Wenseleers Feb 24 '19 at 00:53

Tom Wenseleers · Accepted Answer · 2019-02-28T09:04:58.583

Following the advice below I tried multithreading with OpenMP and this seems to give decent speedups using 8 threads on my laptop. I modified my qselect.cpp file to:

// [[Rcpp::depends(RcppArmadillo)]]
#define RCPP_ARMADILLO_RETURN_COLVEC_AS_VECTOR
#include <RcppArmadillo.h>
using namespace arma;

// [[Rcpp::export]]
double qSelect(arma::vec& x, const int k) {

  // ARGUMENTS
  // x: vector to find k-th largest element in
  // k: k-th statistic to look up

  // safety copy since nth_element modifies in place
  arma::vec y(x.memptr(), x.n_elem);

  // partially sorts y
  std::nth_element(y.begin(), y.begin() + k - 1, y.end());

  // the k-th largest value
  const double kthValue = y(k-1);

  return kthValue;

}


// [[Rcpp::export]]
arma::vec qSelectMbycol(arma::mat& M, const int k) {

  // ARGUMENTS
  // M: matrix for which we want to find the k-th largest elements of each column
  // k: k-th statistic to look up

  arma::mat Y(M.memptr(), M.n_rows, M.n_cols);
  // we apply over columns
  int c = M.n_cols;
  arma::vec out(c);
  int i;
  for (i = 0; i < c; i++) {
      arma::vec y = Y.col(i);
      std::nth_element(y.begin(), y.begin() + k - 1, y.end());
      out[i] = y(k-1); // the k-th largest value of each column
  }

  return out;

}

#include <omp.h>
// [[Rcpp::plugins(openmp)]]

// [[Rcpp::export]]
arma::vec qSelectMbycolOpenMP(arma::mat& M, const int k, int nthreads) {

  // ARGUMENTS
  // M: matrix for which we want to find the k-th largest elements of each column
  // k: k-th statistic to look up
  // nthreads: nr of threads to use

  arma::mat Y(M.memptr(), M.n_rows, M.n_cols);
  // we apply over columns
  int c = M.n_cols;
  arma::vec out(c);
  int i;
  omp_set_num_threads(nthreads);
#pragma omp parallel for shared(out) schedule(dynamic,1)
  for (i = 0; i < c; i++) {
    arma::vec y = Y.col(i);
    std::nth_element(y.begin(), y.begin() + k - 1, y.end());
    out(i) = y(k-1); // the k-th largest value of each column
  }

  return out;

}

Benchmarks:

n = 50000
set.seed(1)
x = rnorm(n=n, mean=100, sd=20)
M = matrix(rnorm(n=n*10, mean=100, sd=20), ncol=10)
tau = 0.01 # desired percentile
k = tau*n+1 # we will get the 6th smallest element
library(Rcpp)
Rcpp::sourceCpp('qselect.cpp')
library(microbenchmark
microbenchmark(apply(M, 2, function (col) sort(col, partial=k)[k]),
               apply(M, 2, function (col) qSelect(col,k)),
               qSelectMbycol(M,k),
               qSelectMbycolOpenMP(M,k,nthreads=8))[,1:4]

Unit: milliseconds
                                                 expr      min       lq      mean    median        uq        max neval cld
 apply(M, 2, function(col) sort(col, partial = k)[k]) 8.937091 9.301237 11.802960 11.828665 12.718612  43.316107   100   b
           apply(M, 2, function(col) qSelect(col, k)) 6.757771 6.970743 11.047100  7.956696  9.994035 133.944735   100   b
                                  qSelectMbycol(M, k) 5.370893 5.526772  5.753861  5.641812  5.826985   7.124698   100  a 
              qSelectMbycolOpenMP(M, k, nthreads = 8) 2.695924 2.810108  3.005665  2.899701  3.061996   6.796260   100  a

I was surprised by the ca 2 fold gain in speed of doing the apply in Rcpp without even using multithreading (qSelectMbycol function) and there was a further 2 fold speed increase with OpenMP multithreading (qSelectMbycolOpenMP).

Any advice on possible code optimization welcome though...

For small n (n<1000) the OpenMP version is not faster, maybe because the individuals jobs are just too small then. E.g. for n=500:

Unit: microseconds
                                                 expr     min       lq      mean   median       uq      max neval cld
 apply(M, 2, function(col) sort(col, partial = k)[k]) 310.477 324.8025 357.47145 337.8465 361.5810 1782.885   100   c
           apply(M, 2, function(col) qSelect(col, k)) 103.921 114.8255 141.59221 119.3155 131.9315 1990.298   100  b 
                                  qSelectMbycol(M, k)  24.377  32.2885  44.13873  35.2825  39.3440  900.210   100 a  
              qSelectMbycolOpenMP(M, k, nthreads = 8)  76.123  92.1600 130.42627  99.8575 112.4730 1303.059   100  b

Fastest way to multithread doing quickselect on all columns or all rows of a matrix in Rcpp - OpenMP, RcppParallel or RcppThread

2 Answers2