3

I am experiencing substantially slower matrix multiplication in R as compared to python. This is for large matrices. For example (in python):

import numpy as np

A = np.random.rand(4112, 23050).astype('float32')
B = np.random.rand(23050, 2500).astype('float32')

%timeit np.dot(A, B)

1 loops, best of 3: 1.09 s per loop

Here is the equivalent multiplication in R (takes almost 10x longer):

A <- matrix(rnorm(4112*23050), ncol = 23050)
B <- matrix(rnorm(23050*2500), ncol = 2500)

system.time(A %*% B)

user    system  elapsed 
72.032   1.048   9.444 

How can I achieve matrix multiplication speeds in R that are comparable to what is standard with python?

What I Have Already Tried:

1) Part of the descrepancy seems to be that python supports float32 whereas R only uses numeric, which is similar to (the same as?) float64. For example, the same python commands as above except with float64 takes twice as long (but still 5x slower than R):

import numpy as np

A = np.random.rand(4112, 23050).astype('float64')
B = np.random.rand(23050, 2500).astype('float64')

%timeit np.dot(A, B)
1 loops, best of 3: 2.24 s per loop

2) I am using the openBLAS linear algebra back-end for R.

3) RcppEigen as detailed in answer to this SO (see link for test.cpp file). The multiplication is about twice as fast in "user" time, but 3x slower in the more critical elapsed time as it only uses 1 of 8 threads.

library(Rcpp)
sourceCpp("test.cpp")

A <- matrix(rnorm(4112*23050), nrow = 4112)
B <- matrix(rnorm(23050*2500), ncol = 2500)

system.time(res <- eigenMatMult(A, B))
user    system  elapsed 
29.436   0.056  29.551 
Community
  • 1
  • 1
alexvpickering
  • 632
  • 1
  • 8
  • 20
  • What OS? http://simplystatistics.org/2016/01/21/parallel-blas-in-r/ ; https://cran.r-project.org/web/packages/gcbd/vignettes/gcbd.pdf . http://stackoverflow.com/questions/10025866/parallel-linear-algebra-for-multicore-system claims OpenBLAS is multithreaded, so ... ?? – Ben Bolker Jul 28 '16 at 17:01
  • PS I was a little annoyed that you deleted and reposted, but it seems [that's actually OK](http://meta.stackoverflow.com/questions/253438/is-deleting-a-question-and-posting-a-new-one-with-issues-fixed-acceptable) since you improved your question. – Ben Bolker Jul 28 '16 at 17:02
  • Ubuntu. `%*%` and Armadillo matrix multiplication (which provided no benefit over `%*%`) use all my cores but Eigen does not. – alexvpickering Jul 28 '16 at 17:08
  • 2
    Maybe see this on Eigen: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html – Dirk Eddelbuettel Jul 28 '16 at 17:11
  • @Dirk Eddelbuettel Thanks for the suggestion: I just tried adding `Eigen::setNbThreads(8);` in test.cpp on line before `Eigen::MatrixXd C = A * B;` but it still used a single thread. – alexvpickering Jul 28 '16 at 17:22
  • by the way, I'm assuming your actual use-case involves dense matrices -- if not then you should definitely be using sparse methods (e.g. `Matrix` package) – Ben Bolker Jul 28 '16 at 17:22
  • were compiler options (`-fopenmp`) set correctly ... ? – Ben Bolker Jul 28 '16 at 17:23
  • @Ben Bolker Yes use case involves dense matrices with similar dimensions to examples above. – alexvpickering Jul 28 '16 at 17:24
  • @Ben Bolker How do I check that? – alexvpickering Jul 28 '16 at 17:28

2 Answers2

3

I use MRO and python with anaconda and the MKL BLAS. Here are my results for the same data generating process, i.e. np.random.rand ('float64') or rnorm and identical dimensions (average and standard deviation over 10 replications ):

Python:

np.dot(A, B) # 1.3616 s (sd = 0.1776)

R:

Bt = t(B)
a = A %*% B # 2.0285 s (sd = 0.1897)
acp = tcrossprod(A, Bt) # 1.3098 s (sd = 0.1206)
identical(acp, a) # TRUE
sbstn
  • 628
  • 3
  • 7
  • 1
    so just to clarify/put your results in words, (1) the MRO/MKL BLAS brings the timing within a factor of 2, (2) using `tcrossprod`gains another factor of 2, making the Python & R timings approx equal? It would be worth knowing the platform (OS) on which you did this – Ben Bolker Jul 28 '16 at 19:35
  • Yes, that's it. The platform is Ubuntu 14.04. I think the setup, i.e. `anaconda` (MRO/python 2.7) + `MKL`, makes sure that the comparison is fair... – sbstn Jul 28 '16 at 19:46
  • 1
    And just to state the obvious: MKL can be obtained independently of either of those two products, and still be provided for both languages. – Dirk Eddelbuettel Jul 28 '16 at 19:57
  • Just a side note that MKL does not support hyperthreading ([reference](https://groups.google.com/forum/#!msg/rropen/h_q_HUsQM2s/_DXisD8hAQAJ)). It will only use 4 of my 8 hyperthreads because my computer only has 4 physical cores. – alexvpickering Jul 28 '16 at 20:29
1

Slightly tangential, but too long for a comment I think. To check whether the relevant compiler flags (e.g. -fopenmp) are set, use sourceCpp("testeigen.cpp",verbose=TRUE).

On my system, this showed that the OpenMP flags are not defined by default.

I did this to enable them (adapted from here):

library(Rcpp)
pkglibs <- "-fopenmp -lgomp"
pkgcxxflags <- "-fopenmp"
Sys.setenv(PKG_LIBS=pkglibs,PKG_CXXFLAGS=pkgcxxflags)
sourceCpp("testeigen.cpp",verbose=TRUE)
  • Dirk Eddelbuettel comments that he prefers to set the compiler flags in ~/.R/Makevars.
  • The example I took this from called the internal Rcpp:::RcppLdFlags and Rcpp:::RcppCxxFlags functions and prepended the results to the flags given above; this seems not to be necessary (?)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • We have a so-called plugin for OpenMP, as we do for C++11, C++14, ... and related switches. See eg [here](https://github.com/RcppCore/Rcpp/blob/master/R/Attributes.R#L484-L488) and IIRC the Rcpp Gallery pieces on OpenMP all show it too. – Dirk Eddelbuettel Jul 28 '16 at 20:12
  • Thanks Ben, this did allow for parallel matrix multiplication with RcppEigen. Surprisingly, the benefit was minor (less than 1s faster). – alexvpickering Jul 28 '16 at 20:39