7

I observed that matrix by vector multiplication with a large matrix is significantly slower in R (version 3.6.1) than in Matlab (version 2019b), while both languages rely on the (same?) BLAS library. See below a minimal example:

In Matlab:

n=900; 
p=900; 
A=reshape(1:(n*p),[n,p]); 
x=ones(p,1); 
tic()
for id = 1:1000
  x = A*x; 
end
toc()

In R:

n=900
p=900
A=matrix(c(1:(n*p)),nrow=n,ncol=p)
x=rep(1,ncol(A))
t0 <- Sys.time()
for(iter in 1:1000){
  x = A%*%x
}
t1 <- Sys.time()
print(t1-t0)

I get a running execution time of roughly 0.05sec in Matlab versus 3.5sec in R using the same computer. Any idea of the reason for such difference?

Thanks.

[EDIT]: I add below a similar calculus in C (using the CBLAS library, compilation using gcc cblas_dgemv.c -lblas -o cblas_dgemv, where cblas_dgemv.c denotes the source file below). I get a running time of roughly 0.08s which is quite close to the running times obtained using Matlab (0.05s). I am still trying to figure out the reason of this huge running time in R (3.5s).

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include "cblas.h"

int main(int argc, char **argv)
{
  int m=900,adr;
  double *A,*x,*y;
  struct timeval t0,t1;

  /* memory allocation and initialization */
  A = (double*)malloc(m*m*sizeof(double)); 
  x = (double*)malloc(m*sizeof(double));  
  y = (double*)malloc(m*sizeof(double));  
  for(adr=0;adr<m;adr++) x[adr] = 1.; 
  for(adr=0;adr<m*m;adr++) A[adr] = adr;

  /* main loop */
  gettimeofday(&t0, NULL);
  for(adr=0;adr<1000;adr++)
    cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A,m,x,1,0.,y,1);
  gettimeofday(&t1, NULL);
  printf("elapsed time = %.2e seconds\n",(double)(t1.tv_usec-t0.tv_usec)/1000000. + (double)(t1.tv_sec-t0.tv_sec));

  /* free memory */
  free(A);
  free(x);
  free(y); 

  return EXIT_SUCCESS;
}

Notice that I was not able to set y=x in the cblas_dgemv routine. Thus, this C calculus is slightly different from that done in R and Matlab codes above. However the compilation was done without optimization flag (no option -O3) and I checked that the matrix-vector product was indeed called at each iteration of the loop (performing 10x more iterations leads to a 10x longer running time).

RémyA
  • 193
  • 8
  • 2
    Matlab is highly optimized. Once I was working with GNU Octave; same code but Matlab was wayyyyy faster. – paul-shuvo Feb 14 '20 at 21:00
  • As far as I know, linear algebra operations rely on BLAS/LAPACK libraries in both languages – RémyA Feb 14 '20 at 21:06
  • 1
    If you run `sessionInfo()` in R, what does it show for "Matrix products"? – MrFlick Feb 14 '20 at 21:27
  • it returns BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0 LAPACK: /usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3 – RémyA Feb 14 '20 at 21:32
  • 1
    Possibly related: https://stackoverflow.com/questions/50857568/large-performance-differences-between-os-for-matrix-computation – Andrew Janke Feb 14 '20 at 22:04
  • Here's something interesting: https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html. Look at the "default" behavior for the "matprod" option: if the inputs "may contain NaN or Inf values", it falls back to an unoptimized 3-loop implementation instead of using BLAS. Maybe that case is being hit here for some reason? But I couldn't get it to speed up much by doing `options(matprod = "blas")`. – Andrew Janke Feb 15 '20 at 00:43
  • Here's where it does the "maybe nan or inf" check: https://github.com/wch/r-source/blob/e206afa85277c7fdcad97db9af79e97e4efda656/src/main/array.c#L1007-L1010 and here's the test itself: https://github.com/wch/r-source/blob/e206afa85277c7fdcad97db9af79e97e4efda656/src/main/array.c#L618-L637. Do you suppose it's getting a false positive there? Or maybe the "may have nan or inf" test itself is taking a long time? – Andrew Janke Feb 15 '20 at 01:20
  • @Andrew: thanks for this suggestion, this is difficult to know but I will try to investigate. At least, I think that my EDIT above shows that the computation time of the matrix-vector product using BLAS is similar to that obtained using Matlab (MKL BLAS). – RémyA Feb 15 '20 at 16:22
  • 1
    @AndrewJanke: I do not think we are facing a false positive response to the `mayHaveNaNOrInf` test since in that case the matrix-vector product should be done with an internal unoptimized 3-loop algorithm that would be even slower. Indeed, by setting `options(matprod = "internal")` I get a running time above 1 minute! Besides, according to (https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html), the setting `options(matprod = "blas")` should prevent from this `mayHaveNaNOrInf` test. This is quite frustrating! – RémyA Feb 15 '20 at 19:45

2 Answers2

10

Here's something kind of shocking:

The precompiled R distribution that is downloaded from CRAN makes use of the reference BLAS/LAPACK implementation for linear algebra operations

The "reference BLAS" is the non-optimized, non-accelerated BLAS, unlike OpenBLAS or Intel MKL. Matlab uses MKL, which is accelerated.

This seems to be confirmed by sessionInfo() in my R 3.6.0 on macOS:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

If I'm reading this right, that means that by default, R uses a slow BLAS, and if you want it to go fast, you need to do some configuration to get it to use a fast BLAS instead.

This is a little surprising to me. As I understand it, the reference BLAS is generally primarily used for testing and development, not for "actual work".

I get about the same timings as you in R 3.6 vs Matlab R2019b on macOS 10.14: 0.04 seconds in Matlab, 4.5 seconds in R. I think that's consistent with R using a non-accelerated BLAS.

Andrew Janke
  • 23,508
  • 5
  • 56
  • 85
  • interesting, I used "sudo update-alternatives --config libblas.so.3-x86_64-linux-gnu" in order to use "/usr/lib/x86_64-linux-gnu/openblas/libblas.so.3" instead of "/usr/lib/x86_64-linux-gnu/blas/libblas.so.3", sessionInfo() in R seems to confirm the change of BLAS library but I do not observe drastical improvements in terms of computation time – RémyA Feb 14 '20 at 21:41
  • I'd be surprised if the reference BLAS is 100 times slower than MKL. On the other hand, loops in MATLAB are 100 times faster now than they were 20 years ago, so it's possible that what you're seeing is the JIT in action. What is the cost of that loop with a much smaller matrix? – Cris Luengo Feb 14 '20 at 21:51
  • 1
    The benchmark I linked to shows MKL being about 40x faster than reference BLAS for matrix cross product multiplication, which is what I think we're doing here. That sounds like it's in the ballpark. For me, doing a trivial matrix multiply gets it down to .001 s. – Andrew Janke Feb 14 '20 at 21:56
  • @RémyA: I'm also having trouble getting my R to use OpenBLAS using DYLD_INSERT_LIBRARIES on macOS. And, oddly, the Homebrew R formula looks like it's trying to build R against OpenBLAS, but may not be succeeding. I'll look in to this a bit more. – Andrew Janke Feb 14 '20 at 21:58
  • @Cris: I get running times of 0.001 sec in both Matlab and R languages for a trivial matrix-vector multiplication (size = 5x5). – RémyA Feb 14 '20 at 22:00
  • 1
    @Andrew: indeed a speedup of 40x using MKL could explain our observations. I will investigate in this direction, thanks. – RémyA Feb 14 '20 at 22:03
  • 1
    Wow, that's a surprising difference between reference and optimized. I'll shut up now. :) – Cris Luengo Feb 14 '20 at 22:06
  • Well, now I'm confused: I switched to a Homebrewed R which is built against OpenBLAS, and sessionInfo looks like it's using it: `BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.7/lib/libopenblasp-r0.3.7.dylib` but it's still just as slow. Similar to what you're seeing. This needs further investigation. – Andrew Janke Feb 14 '20 at 22:18
  • Indeed, this is confusing... According to the above-mentioned benchmark OpenBlas should exhibit a similar speedup as MKL with respect to the default BLAS... I am installing MKL now to check if I can get improvements. – RémyA Feb 14 '20 at 22:22
  • Well, now my sessionInfo() looks like `BLAS/LAPACK: /opt/intel/compilers_and_libraries_2018.2.199/linux/mkl/lib/intel64_lin/libmkl_rt.so` but I still get a running time in R of 2.5 or 3 seconds (far from the expected 40x speedup). – RémyA Feb 14 '20 at 22:26
  • Does R store matrices differently? Maybe a copy is necessary to bring it in a shape that can be used with BLAS? – Cris Luengo Feb 14 '20 at 22:28
  • @CrisLuengo: R is row-major, while Matlab and I think Fortran BLAS are column-major. So it needs to do a copy. You can see it here, I think: https://github.com/wch/r-source/blob/e206afa85277c7fdcad97db9af79e97e4efda656/src/main/array.c#L987-L990. That doesn't seem like it should account for 2+ seconds of run time, though. – Andrew Janke Feb 15 '20 at 00:57
  • Nope, I'm wrong: R is column major; that code is doing something else. (Filling in the other half of a half-filled symmetric matrix, maybe?) – Andrew Janke Feb 15 '20 at 01:04
1

I used Rcpp to interface a C++ implementation of the matrix-vector product with R.

Source C++ file ('cblas_dgemv2.cpp'): performes 'niter' times the product y = A*x

#include <cblas-openblas.h>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericVector cblas_dgemv2(Rcpp::NumericMatrix A, Rcpp::NumericVector x, int niter)
{
  int m=A.ncol(),iter;
  Rcpp::NumericVector y(m);
  for(iter=0;iter<niter;iter++)
    cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A.begin(),m,x.begin(),1,0.,y.begin(),1);
  return y; 
}

Then I perform two experiments using the R code below:

  • Experiment 1: from R, call y=cblas_dgmev2(A,x,1000) to perform 1000 times in C++ the computation of the product y=Ax*;
  • Experiment 2: from R, call 1000 times y=cblas_dgemv2(A,x,1), each call performs in C++ the product y=A*x.
# compile cblas_dgemv2 (you may need to update the path to the library)
PKG_LIBS <- '/usr/lib/x86_64-linux-gnu/libopenblas.a' 
PKG_CPPFLAGS <- '-I/usr/include/x86_64-linux-gnu'
Sys.setenv(PKG_LIBS = PKG_LIBS , PKG_CPPFLAGS = PKG_CPPFLAGS) 
Rcpp::sourceCpp('cblas_dgemv2.cpp', rebuild = TRUE)

# create A and x 
n=900
A=matrix(c(1:(n*n)),nrow=n,ncol=n)
x=rep(1,n)

# Experiment 1: call 1 time cblas_dgemv2 (with 1000 iterations in C++)
t0 <- Sys.time()
y=cblas_dgemv2(A,x,1000) # perform 1000 times the computation y = A%*%x 
t1 <- Sys.time()
print(t1-t0)

# Experiment 2: call 1000 times cblas_dgemv2  
t0 <- Sys.time()
for(iter in 1:1000){
  y=cblas_dgemv2(A,x,1) # perform 1 times the computation y = A%*%x 
}
t1 <- Sys.time()
print(t1-t0)

I get a running time of 0.08s for the first experiment versus 4.8s for the second experiment.

My conclusion: the bottleneck in terms of the running times comes from the data transferts between R and C++, not from the computation of the matrix-vector product itself. Surprising, isn't it?

RémyA
  • 193
  • 8
  • That is a crazy large overhead. Is R really copying all the data? This makes no sense! (Excellent experiment, by the way!) – Cris Luengo Feb 17 '20 at 14:43