Why is this naive matrix multiplication faster than base R's?

Question

In R, matrix multiplication is very optimized, i.e. is really just a call to BLAS/LAPACK. However, I'm surprised this very naive C++ code for matrix-vector multiplication seems reliably 30% faster.

 library(Rcpp)

 # Simple C++ code for matrix multiplication
 mm_code = 
 "NumericVector my_mm(NumericMatrix m, NumericVector v){
   int nRow = m.rows();
   int nCol = m.cols();
   NumericVector ans(nRow);
   double v_j;
   for(int j = 0; j < nCol; j++){
     v_j = v[j];
     for(int i = 0; i < nRow; i++){
       ans[i] += m(i,j) * v_j;
     }
   }
   return(ans);
 }
 "
 # Compiling
 my_mm = cppFunction(code = mm_code)

 # Simulating data to use
 nRow = 10^4
 nCol = 10^4

 m = matrix(rnorm(nRow * nCol), nrow = nRow)
 v = rnorm(nCol)

 system.time(my_ans <- my_mm(m, v))
#>    user  system elapsed 
#>   0.103   0.001   0.103 
 system.time(r_ans <- m %*% v)
#>   user  system elapsed 
#>  0.154   0.001   0.154 

 # Double checking answer is correct
 max(abs(my_ans - r_ans))
 #> [1] 0

Does base R's %*% perform some type of data check that I'm skipping over?

EDIT:

After understanding what's going on (thanks SO!), it's worth noting that this is a worst case scenario for R's %*%, i.e. matrix by vector. For example, @RalfStubner pointed out that using an RcppArmadillo implementation of a matrix-vector multiply is even faster than the naive implementation that I demonstrated, implying considerable faster than base R, but is virtually identical to base R's %*% for matrix-matrix multiply (when both matrices are large and square):

 arma_code <- 
   "arma::mat arma_mm(const arma::mat& m, const arma::mat& m2) {
 return m * m2;
 };"
 arma_mm = cppFunction(code = arma_code, depends = "RcppArmadillo")

 nRow = 10^3 
 nCol = 10^3

 mat1 = matrix(rnorm(nRow * nCol), 
               nrow = nRow)
 mat2 = matrix(rnorm(nRow * nCol), 
               nrow = nRow)

 system.time(arma_mm(mat1, mat2))
#>   user  system elapsed 
#>   0.798   0.008   0.814 
 system.time(mat1 %*% mat2)
#>   user  system elapsed 
#>   0.807   0.005   0.822

So R's current (v3.5.0) %*% is near optimal for matrix-matrix, but could be significantly sped up for matrix-vector if you're okay skipping the checking.

It might not account for all of it, but R’s method has to handle NA values. Also, based on the very little I know about numerical methods in computing, it’s likely that your naive method ends up being unacceptably inn accurately in some circumstances and so other methods will trade some speed for better accuracy. — joran, Jun 27 '18 at 04:03
Looking at: `getAnywhere(`%*%`)`, we have: `function (x, y) .Primitive("%*%")`. So, this is interfacing with a _C_ library but as @joran points out, you are not factoring in `NA` handling. — coatless, Jun 27 '18 at 04:05
@joran: as far as I can tell, this handles `NA`'s properly. The only difference I can see is that this results in a vector not a matrix. — Cliff AB, Jun 27 '18 at 04:06
Possibly there are matrix x matrix optimizations that are not particularly helpful when one of the dimensions is 1? — Cliff AB, Jun 27 '18 at 04:13
This [post](https://radfordneal.wordpress.com/2011/05/21/slowing-down-matrix-multiplication-in-r/) is old and Radford may have successfully gotten some improvements into R since he wrote this, I think this at least summarizes that handling NA, Inf and NaN isn’t always straightforward and does require some work. — joran, Jun 27 '18 at 04:38
You can get huge improvements by using linear algebra libraries for matrix-matrix multiplications because they handle memory and cache better. For matrix-vector multiplications, memory problems are less an issue so that the optimization is smaller. See for example [this](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=15&cad=rja&uact=8&ved=0ahUKEwiF6Zu-1PPbAhUG1hQKHTiKB5o4ChAWCEYwBA&url=http%3A%2F%2Fwww.sdsc.edu%2F~allans%2Fcs260%2Flectures%2Fmatmul.ppt&usg=AOvVaw0mxxrJTE-B3kEeoX3IdI7-). — F. Privé, Jun 27 '18 at 10:52

Josh O'Brien · Accepted Answer · 2018-06-27T05:17:32.603

A quick glance in names.c (here in particular) points you to do_matprod, the C function that is called by %*% and which is found in the file array.c. (Interestingly, it turns out, that both crossprod and tcrossprod dispatch to that same function as well). Here is a link to the code of do_matprod.

Scrolling through the function, you can see that it takes care of a number of things your naive implementation does not, including:

Keeps row and column names, where that makes sense.
Allows for dispatch to alternative S4 methods when the two objects being operated on by a call to %*% are of classes for which such methods have been provided. (That's what's happening in this portion of the function.)
Handles both real and complex matrices.
Implements a series of rules for how to handle multiplication of a matrix and a matrix, a vector and a matrix, a matrix and a vector, and a vector and a vector. (Recall that under cross-multiplication in R, a vector on the LHS is treated as a row vector, whereas on the RHS, it is treated as a column vector; this is the code that makes that so.)

Near the end of the function, it dispatches to either of matprod or or cmatprod. Interestingly (to me at least), in the case of real matrices, if either matrix might contain NaN or Inf values, then matprod dispatches (here) to a function called simple_matprod which is about as simple and straightforward as your own. Otherwise, it dispatches to one of a couple of BLAS Fortran routines which, presumably are faster, if uniformly 'well-behaved' matrix elements can be guaranteed.

Interesting (+1). If these are the only differences, one thing that implies is that *if* I know that I'm doing vanilla matrix x vector operations, I *should* use `my_mm`. That comes as a surprise to me. — Cliff AB, Jun 27 '18 at 04:46
@CliffAB You can probably gain even more by using the appropriate BLAS function either directly or indirectly via RcppArmadillo and using a multithreaded BLAS. — Ralf Stubner, Jun 27 '18 at 05:00
Do you have any ideas why identical matrix multiplication in R has varying speed depending on the [order of matrices?](https://stackoverflow.com/questions/74061447/faster-evaluation-of-matrix-multiplication-from-right-to-left) — dcsuka, Oct 22 '22 at 03:20

Ralf Stubner · Answer 2 · 2018-06-28T09:23:00.177

Josh's answer explains why R's matrix multiplication is not as fast as this naive approach. I was curious to see how much one could gain using RcppArmadillo. The code is simple enough:

arma_code <- 
  "arma::vec arma_mm(const arma::mat& m, const arma::vec& v) {
       return m * v;
   };"
arma_mm = cppFunction(code = arma_code, depends = "RcppArmadillo")

Benchmark:

> microbenchmark::microbenchmark(my_mm(m,v), m %*% v, arma_mm(m,v), times = 10)
Unit: milliseconds
          expr      min       lq      mean    median        uq       max neval
   my_mm(m, v) 71.23347 75.22364  90.13766  96.88279  98.07348  98.50182    10
       m %*% v 92.86398 95.58153 106.00601 111.61335 113.66167 116.09751    10
 arma_mm(m, v) 41.13348 41.42314  41.89311  41.81979  42.39311  42.78396    10

So RcppArmadillo gives us nicer syntax and better performance.

Curiosity got the better of me. Here a solution for using BLAS directly:

blas_code = "
NumericVector blas_mm(NumericMatrix m, NumericVector v){
  int nRow = m.rows();
  int nCol = m.cols();
  NumericVector ans(nRow);
  char trans = 'N';
  double one = 1.0, zero = 0.0;
  int ione = 1;
  F77_CALL(dgemv)(&trans, &nRow, &nCol, &one, m.begin(), &nRow, v.begin(),
           &ione, &zero, ans.begin(), &ione);
  return ans;
}"
blas_mm <- cppFunction(code = blas_code, includes = "#include <R_ext/BLAS.h>")

Benchmark:

Unit: milliseconds
          expr      min       lq      mean    median        uq       max neval
   my_mm(m, v) 72.61298 75.40050  89.75529  96.04413  96.59283  98.29938    10
       m %*% v 95.08793 98.53650 109.52715 111.93729 112.89662 128.69572    10
 arma_mm(m, v) 41.06718 41.70331  42.62366  42.47320  43.22625  45.19704    10
 blas_mm(m, v) 41.58618 42.14718  42.89853  42.68584  43.39182  44.46577    10

Armadillo and BLAS (OpenBLAS in my case) are almost the same. And the BLAS code is what R does in the end as well. So 2/3 of what R does is error checking etc.

And probably OpenMP to boot (provided your OS / compiler support it). — Dirk Eddelbuettel, Jun 27 '18 at 22:01
@Dirk I would have expected that Armadillo forwards such simple things directly to BLAS (which is also multi-threaded in my case). At least they are equally fast ... — Ralf Stubner, Jun 28 '18 at 09:19
Very interesting. It would make sense that the checking costs don't scale as quickly as the computation for matrix-matrix, so this cost disappears in that case. — Cliff AB, Jun 28 '18 at 14:21
@CliffAB Yes. In addition, for matrix-matrix it will be more difficult to out-smart the memory access in your BLAS implementation with a naive approach, c.f. the link provided above by F.Prive. — Ralf Stubner, Jun 28 '18 at 14:27
@RalfStubner R's matrix multiplications behave very strangely with wide/thin matrices [depending on the order](https://stackoverflow.com/questions/74061447/faster-evaluation-of-matrix-multiplication-from-right-to-left). Do you have any idea what causes this? — dcsuka, Oct 22 '22 at 03:17

score 2 · Answer 3 · answered Aug 31 '20 at 06:50

To add another point to Ralf Stubner's solution, then you can use the following C++ version to

do multiple columns at the same time to avoid re-reading the output vector many times.
add __restrict__ to potentially allow for vector operations (likely does not matter here as it is only reads I guess).

#include <Rcpp.h>
using namespace Rcpp;

inline void mat_vec_mult_vanilla
(double const * __restrict__ m, 
 double const * __restrict__ v, 
 double * __restrict__ const res, 
 size_t const dn, size_t const dm) noexcept {
  for(size_t j = 0; j < dm; ++j, ++v){
    double * r = res;
    for(size_t i = 0; i < dn; ++i, ++r, ++m)
      *r += *m * *v;
  }
}

inline void mat_vec_mult
(double const * __restrict__ const m, 
 double const * __restrict__ const v, 
 double * __restrict__ const res, 
 size_t const dn, size_t const dm) noexcept {
  size_t j(0L);
  double const * vj = v,
               * mi = m;
  constexpr size_t const ncl(8L);
  {
    double const * mvals[ncl];
    size_t const end_j = dm - (dm % ncl),
                   inc = ncl * dn;
    for(; j < end_j; j += ncl, vj += ncl, mi += inc){
      double *r = res;
      mvals[0] = mi;
      for(size_t i = 1; i < ncl; ++i)
        mvals[i] = mvals[i - 1L] + dn;
      for(size_t i = 0; i < dn; ++i, ++r)
        for(size_t ii = 0; ii < ncl; ++ii)
          *r += *(vj + ii) * *mvals[ii]++;
    }
  }
  
  mat_vec_mult_vanilla(mi, vj, res, dn, dm - j);
}

// [[Rcpp::export("mat_vec_mult", rng = false)]]
NumericVector mat_vec_mult_cpp(NumericMatrix m, NumericVector v){
  size_t const dn = m.nrow(), 
               dm = m.ncol();
  NumericVector res(dn);
  mat_vec_mult(&m[0], &v[0], &res[0], dn, dm);
  return res;
}

// [[Rcpp::export("mat_vec_mult_vanilla", rng = false)]]
NumericVector mat_vec_mult_vanilla_cpp(NumericMatrix m, NumericVector v){
  size_t const dn = m.nrow(), 
               dm = m.ncol();
  NumericVector res(dn);
  mat_vec_mult_vanilla(&m[0], &v[0], &res[0], dn, dm);
  return res;
}

The result with -O3 in my Makevars file and gcc-8.3 is

set.seed(1)
dn <- 10001L
dm <- 10001L
m <- matrix(rnorm(dn * dm), dn, dm)
lv <- rnorm(dm)

all.equal(drop(m %*% lv), mat_vec_mult(m = m, v = lv))
#R> [1] TRUE
all.equal(drop(m %*% lv), mat_vec_mult_vanilla(m = m, v = lv))
#R> [1] TRUE

bench::mark(
  R              = m %*% lv, 
  `OP's version` = my_mm(m = m, v = lv), 
  `BLAS`         = blas_mm(m = m, v = lv),
  `C++ vanilla`  = mat_vec_mult_vanilla(m = m, v = lv), 
  `C++`          = mat_vec_mult(m = m, v = lv), check = FALSE)
#R> # A tibble: 5 x 13
#R>   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory                 time          gc               
#R>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>                 <list>        <list>           
#R> 1 R             147.9ms    151ms      6.57    78.2KB        0     4     0      609ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [4]>  <tibble [4 × 3]> 
#R> 2 OP's version   56.9ms   57.1ms     17.4     78.2KB        0     9     0      516ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [9]>  <tibble [9 × 3]> 
#R> 3 BLAS           90.1ms   90.7ms     11.0     78.2KB        0     6     0      545ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [6]>  <tibble [6 × 3]> 
#R> 4 C++ vanilla    57.2ms   57.4ms     17.4     78.2KB        0     9     0      518ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [9]>  <tibble [9 × 3]> 
#R> 5 C++              51ms   51.4ms     19.3     78.2KB        0    10     0      519ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [10]> <tibble [10 × 3]>

So a slight improvement. The result may though be very dependent on the BLAS version. The version I used is

sessionInfo()
#R> #...
#R> Matrix products: default
#R> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#R> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#R> ...

The whole file I Rcpp::sourceCpp()ed is

#include <Rcpp.h>
#include <R_ext/BLAS.h>
using namespace Rcpp;

inline void mat_vec_mult_vanilla
(double const * __restrict__ m, 
 double const * __restrict__ v, 
 double * __restrict__ const res, 
 size_t const dn, size_t const dm) noexcept {
  for(size_t j = 0; j < dm; ++j, ++v){
    double * r = res;
    for(size_t i = 0; i < dn; ++i, ++r, ++m)
      *r += *m * *v;
  }
}

inline void mat_vec_mult
(double const * __restrict__ const m, 
 double const * __restrict__ const v, 
 double * __restrict__ const res, 
 size_t const dn, size_t const dm) noexcept {
  size_t j(0L);
  double const * vj = v,
               * mi = m;
  constexpr size_t const ncl(8L);
  {
    double const * mvals[ncl];
    size_t const end_j = dm - (dm % ncl),
                   inc = ncl * dn;
    for(; j < end_j; j += ncl, vj += ncl, mi += inc){
      double *r = res;
      mvals[0] = mi;
      for(size_t i = 1; i < ncl; ++i)
        mvals[i] = mvals[i - 1L] + dn;
      for(size_t i = 0; i < dn; ++i, ++r)
        for(size_t ii = 0; ii < ncl; ++ii)
          *r += *(vj + ii) * *mvals[ii]++;
    }
  }
  
  mat_vec_mult_vanilla(mi, vj, res, dn, dm - j);
}

// [[Rcpp::export("mat_vec_mult", rng = false)]]
NumericVector mat_vec_mult_cpp(NumericMatrix m, NumericVector v){
  size_t const dn = m.nrow(), 
               dm = m.ncol();
  NumericVector res(dn);
  mat_vec_mult(&m[0], &v[0], &res[0], dn, dm);
  return res;
}

// [[Rcpp::export("mat_vec_mult_vanilla", rng = false)]]
NumericVector mat_vec_mult_vanilla_cpp(NumericMatrix m, NumericVector v){
  size_t const dn = m.nrow(), 
               dm = m.ncol();
  NumericVector res(dn);
  mat_vec_mult_vanilla(&m[0], &v[0], &res[0], dn, dm);
  return res;
}

// [[Rcpp::export(rng = false)]]
NumericVector my_mm(NumericMatrix m, NumericVector v){
  int nRow = m.rows();
  int nCol = m.cols();
  NumericVector ans(nRow);
  double v_j;
  for(int j = 0; j < nCol; j++){
    v_j = v[j];
    for(int i = 0; i < nRow; i++){
      ans[i] += m(i,j) * v_j;
    }
  }
  return(ans);
}

// [[Rcpp::export(rng = false)]]
NumericVector blas_mm(NumericMatrix m, NumericVector v){
  int nRow = m.rows();
  int nCol = m.cols();
  NumericVector ans(nRow);
  char trans = 'N';
  double one = 1.0, zero = 0.0;
  int ione = 1;
  F77_CALL(dgemv)(&trans, &nRow, &nCol, &one, m.begin(), &nRow, v.begin(),
           &ione, &zero, ans.begin(), &ione);
  return ans;
}

/*** R
set.seed(1)
dn <- 10001L
dm <- 10001L
m <- matrix(rnorm(dn * dm), dn, dm)
lv <- rnorm(dm)

all.equal(drop(m %*% lv), mat_vec_mult(m = m, v = lv))
all.equal(drop(m %*% lv), mat_vec_mult_vanilla(m = m, v = lv))

bench::mark(
  R              = m %*% lv, 
  `OP's version` = my_mm(m = m, v = lv), 
  `BLAS`         = blas_mm(m = m, v = lv),
  `C++ vanilla`  = mat_vec_mult_vanilla(m = m, v = lv), 
  `C++`          = mat_vec_mult(m = m, v = lv), check = FALSE)
*/

Interesting: in your results, BLAS is considerably slower than the straightforward C++ version (yours or mine). @RalfStubner's results have BLAS approximately twice as fast as mine. Could Ralf's BLAS be using 2 (or more) threads? Or different versions? — Cliff AB, Aug 31 '20 at 17:21
RalfStubner states he is using OpenBLAS. I am using the default BLAS so I am figurering that this is the cause of the difference. I suspect that it is just the implementation but it may be that he is using more threads. — Benjamin Christoffersen, Sep 01 '20 at 05:07

Why is this naive matrix multiplication faster than base R's?

3 Answers3

Linked