I observed that matrix by vector multiplication with a large matrix is significantly slower in R (version 3.6.1) than in Matlab (version 2019b), while both languages rely on the (same?) BLAS library. See below a minimal example:
In Matlab:
n=900;
p=900;
A=reshape(1:(n*p),[n,p]);
x=ones(p,1);
tic()
for id = 1:1000
x = A*x;
end
toc()
In R:
n=900
p=900
A=matrix(c(1:(n*p)),nrow=n,ncol=p)
x=rep(1,ncol(A))
t0 <- Sys.time()
for(iter in 1:1000){
x = A%*%x
}
t1 <- Sys.time()
print(t1-t0)
I get a running execution time of roughly 0.05sec in Matlab versus 3.5sec in R using the same computer. Any idea of the reason for such difference?
Thanks.
[EDIT]: I add below a similar calculus in C (using the CBLAS library, compilation using gcc cblas_dgemv.c -lblas -o cblas_dgemv
, where cblas_dgemv.c denotes the source file below). I get a running time of roughly 0.08s which is quite close to the running times obtained using Matlab (0.05s). I am still trying to figure out the reason of this huge running time in R (3.5s).
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include "cblas.h"
int main(int argc, char **argv)
{
int m=900,adr;
double *A,*x,*y;
struct timeval t0,t1;
/* memory allocation and initialization */
A = (double*)malloc(m*m*sizeof(double));
x = (double*)malloc(m*sizeof(double));
y = (double*)malloc(m*sizeof(double));
for(adr=0;adr<m;adr++) x[adr] = 1.;
for(adr=0;adr<m*m;adr++) A[adr] = adr;
/* main loop */
gettimeofday(&t0, NULL);
for(adr=0;adr<1000;adr++)
cblas_dgemv(CblasColMajor,CblasNoTrans,m,m,1.,A,m,x,1,0.,y,1);
gettimeofday(&t1, NULL);
printf("elapsed time = %.2e seconds\n",(double)(t1.tv_usec-t0.tv_usec)/1000000. + (double)(t1.tv_sec-t0.tv_sec));
/* free memory */
free(A);
free(x);
free(y);
return EXIT_SUCCESS;
}
Notice that I was not able to set y=x in the cblas_dgemv routine. Thus, this C calculus is slightly different from that done in R and Matlab codes above. However the compilation was done without optimization flag (no option -O3) and I checked that the matrix-vector product was indeed called at each iteration of the loop (performing 10x more iterations leads to a 10x longer running time).