optimize matrix multiplication in for loop RcppArmadillo

Question

The aim is to implement a fast version of the orthogonal projective non-negative matrix factorization (opnmf) in R. I am translating the matlab code available here.

I implemented a vanilla R version but it is much slower (about 5.5x slower) than the matlab implementation on my data (~ 225000 x 150) for 20 factor solution.

So I thought using c++ might speed up things but its speed is similar to R. I think this can be optimized but not sure how as I am a newbie to c++. Here is a thread that discusses a similar problem.

Here is my RcppArmadillo implementation.

// [[Rcpp::export]]
Rcpp::List arma_opnmf(const arma::mat & X, const arma::mat & W0, double tol=0.00001, int maxiter=10000, double eps=1e-16) {
  arma::mat W = W0;
  arma::mat Wold = W;
  arma::mat XXW = X * (X.t()*W);
  double diffW = 9999999999.9;
  
  Rcout << "The value of maxiter : " << maxiter << "\n";
  Rcout << "The value of tol : " << tol << "\n";
  
  int i;
  for (i = 0; i < maxiter; i++) {
    XXW = X * (X.t()*W);
    W = W % XXW / (W  * (W.t() * XXW));
    //W = W % (X*(X.t()*W)) / (W*((W.t()*X)*(X.t()*W)));
    
    arma::uvec idx = find(W < eps);
    W.elem(idx).fill(eps);
    W = W / norm(W,2);
    diffW = norm(Wold-W, "fro") / norm(Wold, "fro");
    if(diffW < tol) {
      break;
    } else {
      Wold = W;
    }
    
    if(i % 10 == 0) {
      Rcpp::checkUserInterrupt();
    }
    
  }
  return Rcpp::List::create(Rcpp::Named("W")=W,
                            Rcpp::Named("iter")=i,
                            Rcpp::Named("diffW")=diffW);
}

This suggested issue confirms that matlab is quite fast, so is there no hope when using R / c++?

The tests were made on Windows 10 and Ubuntu 16 with R version 4.0.0.

EDIT

After the interesting comments in the answer below. I am posting additional details. I ran tests on a Windows 10 machine with R 3.5.3 (as that's what Microsoft provides) and the comparison shows that RcppArmadillo with Microsoft's R is fastest.

R

   user  system elapsed 
 213.76    7.36  221.42

R with RcppArmadillo

   user  system elapsed 
 179.88    3.44  183.43

Microsoft's Open R

   user  system elapsed 
 167.33    9.96   45.94

Microsoft's Open with RcppArmadillo

    user  system elapsed 
  85.47    4.66   23.56

Yes -- "Microsoft R" ships with the MKL which is why my answer (written when I did not yet know what OS you are using) pointed to ways of instaling the MKL. The nice thing about R and Armadillo and of course then also RcppArmadillo is that they use this "automatically" once you have it installed. And "Microsoft R" is still an easy way to get the MKL, but these days it is no longer the only way as Intel now uses a more liberal license for the MKL which is a good thing. — Dirk Eddelbuettel, Jun 20 '20 at 14:38
Also notice the "scale" between elapsed and user. In the first case you have no parallelism at all from base R and Armadillo. Once you have the MKL, elapsed time is (very roughly) a quarter to (total) user time, That is your gain from multicore. — Dirk Eddelbuettel, Jun 20 '20 at 14:42

Dirk Eddelbuettel · Accepted Answer · 2020-06-20T15:09:22.463

Are you aware that this code is "ultimately" executed by a pair of libraries called LAPACK and BLAS?

Are you aware that Matlab ships with a highly optimised one? Are you aware that on all systems that R runs on you can change which LAPACK/BLAS is being used.

The difference matters greatly. Just this morning a friend posted this tweet contrasting the same R code running on the same Windows computer but in two different R environments. The six-times faster one "simply" uses a parallel LAPACK/BLAS implementation.

Here, you haven't even told us which operating system you are on. You can get OpenBLAS (which uses parallelism) for all OSs that R runs on. You can even get the Intel MKL (which IIRC is what Matlab uses too) fairly easily on some OSs. For Ubuntu/Debian I published a script on GitHub that does it in one step.

Lastly, many years ago I "inherited" a fast program running in Matlab on a (then-large-ish) Windows computer. I rewrote the Matlab part (carefully and slowly, it's effort) in C++ using RcppArmadillo leading a few factors of improvement -- and because we could run that (now open source) code in parallel from R on the same computer another few factors. Together it was orders of magnitude turning a day-long simulation into something that ran a few minutes. So "yes, you can".

Edit: As you have access to Ubuntu, you can switch from basic LAPACK/BLAS to OpenBLAS via a single command, though I am no longer that familiar with Ubuntu 16.04 (as I run 20.04 myself).

Edit 2: Picking up the comparison from Josef's tweet, the Docker r-base container I also maintainer (as part of the Rocker Project) can use OpenBLAS. [1] So once we add it, e.g. via apt-get install libopenblas-dev the timing of a simple repeated matrix crossproduct moves from

root@0eb44b1fcc06:/# Rscript -e 'v <- matrix(1:1e6,1e3); system.time(replicate(10, crossprod(v,v)))'
   user  system elapsed 
  9.289   0.084   9.373 
root@0eb44b1fcc06:/#

to

root@67bd334f53d4:/# Rscript -e 'v <- matrix(1:1e6,1e3); system.time(replicate(10, crossprod(v,v)))'
   user  system elapsed 
  2.259   2.370   0.447 
root@67bd334f53d4:/#

which is substantial.

Thanks for your comments. This was tested on Windows 10 and Ubuntu 16 both with R4.0.0 and latest versions of Rcpp/RcppArmadillo. Will add this information to the question. But this still basically means that the "out-of-box" matlab is still faster than R/c++ for a naive user like me. — Krrr, Jun 20 '20 at 13:44
"Sure". But that is an odd way of phrasing it. "This car has a manual gear box so I will only drive in first gear." Sure, you could. But that is why I pointed you to Josef's tweet: There are _zero-effort_ ways to get better LAPACK/BLAS even when you are on Windows. On Debian and Ubuntu it is a one-line command. — Dirk Eddelbuettel, Jun 20 '20 at 13:47
Just tested the vanilla R code with Microsoft's R (also mentioned in the tweet response) which uses MKL. It tool 51sec (elapsed) compared to R taking 224sec. This is a substantial improvement and if this (using MKL) becomes default behavior of R then one doesn't need to think of/program additional gears. — Krrr, Jun 20 '20 at 14:06
Yes I have accepted the answer but I do agree with @gph that the response was not optimal and does not address the actual question completely. — Krrr, Jun 20 '20 at 14:17
You can always rephrase your question. Allow me to point out, again, that you didn;t even state the OS you were running on at first... Details matter. — Dirk Eddelbuettel, Jun 20 '20 at 14:18
of course details matter and even though the first question was not "optimal/complete" we do have comments for clarifications and I did provide details as soon as they were asked for. — Krrr, Jun 20 '20 at 14:19

optimize matrix multiplication in for loop RcppArmadillo

1 Answers1