I have briefly covered this issue in several my answers on statistical computing questions. The follow up part of my this answer is relatively comprehensive. Note, my discussion there, as well as the following really assumes that you know what BLAS is, especially what level-3 BLAS routines dgemm
and dsyrk
are.
My answer here, will provide some evidence and benchmarking to assure you of my argument over there. Additionally, explanation for Douglas Bates' comment will be given.
What do crossprod
and "%*%"
really do?
Let's check the source code for both routines. C level source code for R base package is largely found under
R-release/src/main
Particularly, matrix operations are defined in
R-release/src/main/array.c
Now,
"%*%"
matches to a C routine matprod
, which calls dgemm
with transa = "N"
and transb = "N"
;
crossprod
is readily seen to be a composite function, matching to symcrossprod
, crossprod
, symtcrossprod
and tcrossprod
(counterparts for complex matrices, like ccrossprod
, are not listed here).
You should now understand that crossprod
avoids all explicit matrix transposition. crossprod(A, B)
is cheaper than t(A) %*% B
, as the latter needs an explicit matrix transposition for A
. In the following, I will refer to this as transposition overhead.
R level profiling can expose this overhead. Consider the following example:
A <- matrix(ruinf(1000 * 1000), 1000)
B <- matrix(ruinf(1000 * 1000), 1000)
Rprof("brutal.out")
result <- t.default(A) %*% B
Rprof(NULL)
summaryRprof("brutal.out")
Rprof("smart.out")
result <- crossprod(A, B)
Rprof(NULL)
summaryRprof("smart.out")
Note how t.default
comes into the profiling result for the first case.
Profiling also gives CPU time for the execution. You will see that both seems to have taken the same amount of time, as the overhead is insignificant. Now, I will show you when the overhead is a pain.
When is the transposition overhead significant?
Let A
be a k * m
matrix, and B
a k * n
matrix, then matrix multiplication A'B
('
means transpose) has FLOP count (amount of floating point addition and multiplication) 2mnk
. If you do t(A) %*% B
, transposition overhead is mk
(the number of elements in A
), so the ratio
useful computation : overhead = 2n : 1
Unless n
is big, transposition overhead can not be amortized in actual matrix multiplication.
The most extreme case is n = 1
, i.e., you have a single-column matrix (or a vector) for B
. Consider a benchmarking:
library(microbenchmark)
A <- matrix(runif(2000 * 2000), 2000)
x <- runif(2000)
microbenchmark(t.default(A) %*% x, crossprod(A, x))
You will see that crossprod
is several times faster!
When is "%*%"
structurally inferior?
As mentioned in my linked answer (as well as in Bates' notes with benchmarking result), if you do A'A
, crossprod
is sure to be 2 times faster.
What if you have sparse matrices?
Having sparse matrices does not change the fundamental conclusion above. R package Matrix
for setting up sparse matrices also has sparse computation methods for "%*%"
and crossprod
. So you should still expect crossprod
to be slightly faster.
So what does Bates' comment on BLAS "as of summer 2007" mean?
This is nothing related to sparse matrices. BLAS is strictly intended for dense numerical linear algebra.
What it relates to is the difference in the loop variants used inside Netlib's F77 reference dgemm
. There are two loop variants for scheduling matrix-matrix multiplications op(A) * op(B)
(the *
here denotes matrix multiplication not elementwise product!), and the variant is completely decided by the transpose setting of op(A)
:
- for
op(A) = A'
, "inner product" version is used in the innermost loop, in which case no zero detection is possible;
- for
op(A) = A
, "AXPY" version is used, and any zero in op(B)
can be excluded from computation.
Now think about how R calls dgemm
. The first case corresponds to crossprod
, while the second case to "%*%"
(as well as tcrossprod
).
In this aspect, if your B
matrix have a lot of zeros, while it is still in dense matrix format, then t(A) %*% B
will be faster than crossprod(A, B)
, simply because the loop variant for the former is more efficient.
The most illuminating example is when B
is a diagonal matrix, or a banded matrix. Let's consider a banded matrix (actually a symmetric tri-diagonal matrix here):
B_diag <- diag(runif(1000))
B_subdiag <- rbind(0, cbind(diag(runif(999)), 0))
B <- B_diag + B_subdiag + t(B_subdiag)
Now let A
still be a full dense matrix
A <- matrix(runif(1000 * 1000), 1000)
Then compare
library(microbenchmark)
microbenchmark(t.default(A) %*% B, crossprod(A, B))
You will see that "%*%"
is much faster!
I guess a better example is that matrix B
is a triangular matrix. This is fairly common in practice. A triangular matrix is not seen as a sparse matrix (and should not be stored as a sparse one, either).
Caution: If you are using R with an optimized BLAS library, like OpenBLAS or Intel MKL, you will still see that crossprod
is faster. However, this is really because the blocking & caching strategy in any optimized BLAS libraries destroys the loop variants scheduling pattern as in Netlib's F77 reference BLAS. As a result, any "zero-detection" is impossible. So what you will observe is that for this particular example, F77 reference BLAS is even faster than optimized BLAS.