5

On my two computers, I tried this code:

N <- 10e3
M <- 2000
X <- matrix(rnorm(N * M), N)
system.time(crossprod(X))

The first one is a standard laptop and this operation takes 1.7 sec.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

The second one is a rather good desktop computer and it took 17 seconds.

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.3

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

The desktop computer is more performant than the laptop, yet it takes 10 times more time for this matrix computation.

Is the problem coming from the default BLAS/LAPACK used?

F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • 1
    You have one hell of a standard laptop there. Takes 31 secs on mine – David Arenburg Jun 14 '18 at 12:40
  • 8 GB, 2 cores, just a standard computer here. This is why I ask about the math library used. – F. Privé Jun 14 '18 at 13:54
  • 2
    It could be. In [this book](https://csgillespie.github.io/efficientR/set-up.html#blas-and-alternative-r-interpreters) they suggest that they got huge gain in performance by installing OpenBLAS. I also don't this RAM/cores impact performance much (as long as you can fit to memory) as R isn't multi-threading by default IIRC. Probably your CPU combined with BLAS and R version would be the key factors, though I may be wrong. – David Arenburg Jun 14 '18 at 14:11
  • 2
    On my PC (Windows >= 8 x64 (build 9200) it takes 0.25s using R Open (3.5) and MKL while it takes 19.63s on regular R (3.4.2) – missuse Jun 14 '18 at 16:22

2 Answers2

8

tldr: CentOS uses single-threaded OpenBLAS, Linux Mint uses Reference BLAS by default but can use other BLAS versions.

The R packages for CentOS available from EPEL depend on openblas-Rblas. This seems to be an OpenBLAS build providing BLAS for R. So while it looks like R's BLAS is used, it actually is OpenBLAS. The LAPACK version is always the one provided by R.

On Debian and derived distributions like Mint, r-base-core depends on

  • libblas3 | libblas.so.3
  • liblapack3 | liblapack.so.3

By default these are provided by the reference implementations libblas3 and liblapack3. These are not particularly fast, but you can replace them easily by installing packages like libopenblas-base. You have control over the BLAS and LAPACK used on your system via update-alternatives.

For controlling the number of threads with OpenBLAS I normally use RhpcBLASctl:

N <- 20000
M <- 2000
X <- matrix(rnorm(N * M), N)
RhpcBLASctl::blas_set_num_threads(2)
system.time(crossprod(X))
#>        User      System verstrichen 
#>       2.492       0.331       1.339
RhpcBLASctl::blas_set_num_threads(1)
system.time(crossprod(X))
#>        User      System verstrichen 
#>       2.319       0.052       2.316

For some reason setting the environment variables OPENBLAS_NUM_THREADS, GOTO_NUM_THREADS or OMP_NUM_THREADS from R does not have the desired effect. On CentOS even RhpcBLASctl does not help, since the used OpenBLAS is single-threaded.

Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • I installed `libopenblas-base`, yet I don't know how to tell it to use only one core for matrix computations. Do you know how to do it? – F. Privé Jul 03 '18 at 12:28
  • Unfortunately, using `RhpcBLASctl::blas_set_num_threads(2)` has no effect on my computer. – F. Privé Jul 03 '18 at 15:03
  • @F.Privé That is odd. I see it not only in the run-times but also in `htop`. Do the environment variables work for you? – Ralf Stubner Jul 03 '18 at 15:18
  • Nop, I tried `GOTO_NUM_THREADS` before and it has no effect. – F. Privé Jul 03 '18 at 18:22
  • @F.Prive Strange, would that be the CentOS, the Mint system or both? – Ralf Stubner Jul 03 '18 at 22:09
  • Sorry, the CentOS one. I won't have access to the Linux Mint one until next week (but I recall that `GOTO_NUM_THREADS` was useless too). – F. Privé Jul 04 '18 at 04:47
  • @F.Privé Wait a sec. Are you asking how to use **more** than one thread on CentOS? I fear that is not possible with the default packages. At least I interpret https://src.fedoraproject.org/cgit/rpms/openblas.git/tree/openblas.spec#n401 such that a serial build of OpenBLAS is used for RBlas. – Ralf Stubner Jul 04 '18 at 05:02
  • You're probably right. I'll test on Linux Mint later. Thanks. – F. Privé Jul 04 '18 at 05:13
  • Nice sleuthing: will keep in mind if/when using `R` on Debian/ubuntu and need the performance. – WestCoastProjects Sep 16 '19 at 13:02
6

R is distributed with a default BLAS implementation, but it may not be optimized for your computer. Intalling optimized versions of BLAS through ATLAS or OpenBLAS prior to R as advised on the Installation guide of R is the way to go. If you clik on Download R for Linux and then on debian/. It is said that:

You may want to install the automatically tuned Atlas or the multi-threaded OpenBlas library in order to get higher performance for linear algebra operations

The source of R can be downloaded here and the BLAS implementation is located in R-3.5.0/src/extra/blas. For instance, the Fortran source code of matrix-matrix multiplication dgemm is located in blas.f, along most BLAS routines (in a single file!).

The comments of the function specifies:

 -- Written on 8-February-1989.
 Jack Dongarra, Argonne National Laboratory.
 Iain Duff, AERE Harwell.
 Jeremy Du Croz, Numerical Algorithms Group Ltd.
 Sven Hammarling, Numerical Algorithms Group Ltd.

The very same lines can be found in the netlib implementation of the routine dgemm.

On the contrary, OpenBLAS provides different implementations, one for each kind of processors. See for instance this file devoted to dgemm for haswell microarchitecture. There are calls to prefetcht0 for prefetching and calls to vfmadd231pd, a vectorial FMA SIMD instruction which performs double precision d=a*b+c 4 times at once.

Using an optimized BLAS can save the day. See for instance this benchmark, where netlib's dgemm() lasts 64 seconds where MKL, OpenBLAS or ATLAS dgemm takes less than 4 seconds.

The case of R internal BLAS might be worse than the classical Netlib library. Indeed, as specified in the appendix A.3.1.5 Shared BLAS in the R Installation and Administration:

R offers the option of compiling the BLAS into a dynamic library libRblas stored in R_HOME/lib and linking both R itself and all the add-on packages against that library. .... There may be performance disadvantages in using a shared BLAS. ... However, experiments showed that in many cases using a shared BLAS was as fast, provided high levels of compiler optimization are used.

Looking at the config.site file of R, it is written that the optimization level is '-O2' for g77/gfortran. Therefore, tunning the FFLAGS option could proove useful if the fortran compiler is not g77/gfortran. During the configure step, there should be a line like checking whether we are using the GNU Fortran 77 compiler... yes (line 7521 of the configure file).

francis
  • 9,525
  • 2
  • 25
  • 41
  • Thanks. I am aware of more optimized Math libraries. What I don't understand here is why these two defaults provide that much difference in terms of performance. – F. Privé Jun 14 '18 at 18:23
  • The configure of R makes use of -O2 optimization if and only if g77/gfortran is detected properly. It might also be the case for ifc/ifort on x86_64 platforms. Let me know if you wish me to suppress my answer! – francis Jun 14 '18 at 21:07
  • No, your answer is good (I've upvoted it). Not just exactly the answer I'm searching for. I'll try to check if those g77/gfortran are installed. Thanks – F. Privé Jun 15 '18 at 05:59
  • I have gfortran installed, but I can't find any of the other 3 libraries you mentioned. – F. Privé Jun 15 '18 at 09:32