Faster evaluation of matrix multiplication from right to left

Question

I noticed that evaluating matrix operations in quadratic form from right to left is significantly faster than left to right in R, depending on how the parentheses are placed. Obviously they both perform the same amount of calculation. I am wondering why this is the case. Does this have anything to do with memory allocation?

# A: 5000 * 5000
# B: 5000 * 2
A = matrix(runif(5000 * 5000), nrow = 5000)
B = matrix(rbinom(5000 * 2, size = 2, prob = 0.3), nrow = 5000)

microbenchmark((t(B) %*% A) %*% B, t(B) %*% (A %*% B), times = 100)

Here is the session info:

R version 4.2.0 (2022-04-22)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rcpp_1.0.9           microbenchmark_1.4.9

loaded via a namespace (and not attached):
 [1] compiler_4.2.0           fastmap_1.1.0            cli_3.3.0                htmltools_0.5.3          tools_4.2.0             
 [6] RcppArmadillo_0.11.2.4.0 rstudioapi_0.13          yaml_2.3.5               rmarkdown_2.14           knitr_1.39              
[11] xfun_0.31                digest_0.6.29            rlang_1.0.4              evaluate_0.15

EDIT: A simplified version of the matrix multiplication which displays the same error.

k <- 5000L; m <- n <- 2L;
A <- matrix(rnorm(k * k), k, k);
B <- matrix(rnorm(k * n), k, n);
tB <- t(B);
microbenchmark::microbenchmark(tB %*% A, A %*% B, times = 100)

I can't reproduce this (in fact the second is a bit slower on my system). Can you post `sessionInfo()` please (Im usinf openblas) — user20650, Oct 13 '22 at 21:11
@user20650 It’s funny - I can’t reproduce this in a cluster. Not sure what is happening here... — Taotao Tan, Oct 13 '22 at 21:54
I see the first is a bit faster than the second: 92 vs 60 ms per rep. Not the same as you saw. — user2554330, Oct 13 '22 at 22:19
I know this doesn't directly address your question, but you could also think about the relative timing of `crossprod()`/`tcrossprod()` ... — Ben Bolker, Oct 13 '22 at 23:07
I can fully reproduce this. Interestingly, @BenBolker, using `microbenchmark(crossprod(B, A) %*% B, t(B) %*% (A %*% B), times = 100)` cuts the first operation in half down to 67 ms, but it is still measurably slower than the second. The base R operation `%*%` is quite complex with `NA` and infinite checks, and there is significant potential for overhead especially with thin matrices or vectors. Perhaps a relevant post: https://stackoverflow.com/questions/51054227/why-is-this-naive-matrix-multiplication-faster-than-base-rs — dcsuka, Oct 14 '22 at 04:57
@TaotaoTan can you try the same with `bench::mark(..., iterations = 100)`. In my case the second approach is faster with `microbenchmark()` and slower with `bench::mark()`. Really odd. — TimTeaFan, Oct 19 '22 at 19:30
@TaotaoTan I now ran it 1000 times with `bench::mark()` and the median time is balanced now between both approaches. This does not happen with `microbenchmark()` so I suspect the issue rather lies in the timing function than in the actual R code. — TimTeaFan, Oct 19 '22 at 20:09
@TimTeaFan I still think the speed of the actual function is different. You may test this by running the loop 100 times and comparing which one is faster. — Taotao Tan, Oct 19 '22 at 20:48
I get exactly the same results with `microbenchmark()` and `bench::mark()`, still a huge difference. — dcsuka, Oct 19 '22 at 21:04
@dcsuka even with `1000` iterations? Really strange why we see so many differences here across different users and timing libraries. — TimTeaFan, Oct 20 '22 at 08:20
There are too many red herrings here - you should try to simplify. The following exhibits the "problem" on my system: `k <- 5000L; m <- n <- 2L; A <- matrix(rnorm(k * k), k, k); B <- matrix(rnorm(k * n), k, n); tB <- t(B); microbenchmark::microbenchmark(tB %*% A, A %*% B)` — Mikael Jagan, Oct 20 '22 at 15:43
@TimTeaFan Yes still with 1000. Mikael Jagan is right, the simpler form is more direct, will edit to include that as well. — dcsuka, Oct 20 '22 at 21:19
I don't know how helpful this is, but in this `A %*% B` = `C`, yet `tB %*% A` ≠ `t(C)`, so not exactly the same calculations are performed. All-ones matrices should be an exception, and with `k <- 5000L; m <- n <- 2L; A <- matrix(1, k, k); B <- matrix(1, k, n); tB <- t(B); microbenchmark::microbenchmark(tB %*% A, A %*% B)` I still get similar ~50% slower for `tB %*% A` each time — Roasty247, Oct 23 '22 at 14:52
Actually...should the simplification of OP to eliminate the `t()` not be `tB %*% (A %*% B)` and `(tB %*% A) %*% B` because `tB %*% A` ≠ `A %*% B`? — Roasty247, Oct 23 '22 at 15:03
I think the simplification just finds the bottleneck of the multiplication procedure which causes the time discrepancy, the ignored second step makes sure the two sides are equivalent but doesn't influence this time discrepancy. — dcsuka, Oct 23 '22 at 17:07
Just like @user20650 I cannot reproduce this. I am also using OpenBLAS. — Ralf Stubner, Oct 24 '22 at 17:37

GKi · Accepted Answer · 2022-10-25T19:23:10.010

It looks like this is a matter of the implementation and in which order the elements are accessed. See Why does the order of the loops affect performance when iterating over a 2D array?.

When doing the same but changing the order of loops the timing is different.

Rcpp::cppFunction("NumericMatrix mm(const NumericMatrix& A, const NumericMatrix& B) {
int M = A.nrow();
int N = A.ncol();
int P = B.ncol();
NumericMatrix res(M, P);
for (int n=0; n<N; ++n) {  //Loop n, p, m
  for (int p=0; p<P; ++p) {
    for (int m=0; m<M; ++m) {
      res[m+p*M] += A[m+M*n] * B[p*N+n];
    }
  }
}
return res;}")

Rcpp::cppFunction("NumericMatrix mm2(const NumericMatrix& A, const NumericMatrix& B) {
int M = A.nrow();
int N = A.ncol();
int P = B.ncol();
NumericMatrix res(M, P);
for (int m=0; m<M; ++m) {  //Loop m, p, n
  for (int p=0; p<P; ++p) {
     for (int n=0; n<N; ++n) {
      res[m+p*M] += A[m+M*n] * B[p*N+n];
    }
  }
}
return res;}")

k <- 5000L; m <- n <- 2L;
A <- matrix(rnorm(k * k), k, k);
B <- matrix(rnorm(k * n), k, n);
tB <- t(B);

met <- alist("(tB*A)B"     = tB %*% A %*% B,
             "tB(A*B)"     = tB %*% (A %*% B),
             "mm (tB*A)B"  = mm(mm(tB, A), B),
             "mm tB(A*B)"  = mm(tB, mm(A, B)),
             "mm2 (tB*A)B" = mm2(mm2(tB, A), B),
             "mm2 tB(A*B)" = mm2(tB, mm2(A, B)),
             "cp(B,A)B"    = crossprod(B, A) %*% B,
             "cp(B,A*B)"   = crossprod(B, A %*% B) )

bench::mark(exprs = met)
#  expression     min median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result  
#  <bch:expr>  <bch:> <bch:>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>  
#1 (tB*A)B     79.5ms 80.1ms    12.5 78.17KB       0     7     0   562ms <dbl[…]>
#2 tB(A*B)     33.8ms 34.4ms    28.4 78.17KB       0    15     0   528ms <dbl[…]>
#3 mm (tB*A)B  61.9ms 62.5ms    15.9  3.85MB       0     8     0   502ms <dbl[…]>
#4 mm tB(A*B)  19.9ms 20.7ms    48.1 83.16KB       0    25     0   520ms <dbl[…]>
#5 mm2 (tB*A)B 35.9ms 39.4ms    25.8 87.29KB       0    13     0   504ms <dbl[…]>
#6 mm2 tB(A*B) 47.8ms 48.1ms    20.6 83.16KB       0    11     0   535ms <dbl[…]>
#7 cp(B,A)B    44.1ms 44.5ms    22.4 80.42KB       0    12     0   536ms <dbl[…]>
#8 cp(B,A*B)   34.1ms 36.5ms    27.1 78.17KB       0    14     0   516ms <dbl[…]>

microbenchmark::microbenchmark(list = met)
#Unit: milliseconds
#        expr      min       lq     mean   median       uq      max neval
#     (tB*A)B 77.09484 77.86891 79.09483 78.44832 80.08971 87.05563   100
#     tB(A*B) 33.63306 34.22562 36.08482 35.14064 36.64080 51.39962   100
#  mm (tB*A)B 62.05235 64.14361 66.54568 65.16927 67.98617 75.96242   100
#  mm tB(A*B) 19.67066 20.28369 20.83781 20.53820 21.19940 23.64119   100
# mm2 (tB*A)B 35.31290 35.70006 36.62846 36.10282 37.41669 40.47473   100
# mm2 tB(A*B) 48.16574 49.70702 51.55844 50.26292 52.46479 67.44558   100
#    cp(B,A)B 43.18166 44.01366 45.28434 44.71301 46.41521 48.97891   100
#   cp(B,A*B) 33.62158 34.47070 35.84743 35.11853 36.55979 48.89021   100

Faster evaluation of matrix multiplication from right to left

1 Answers1

Linked