4

When exactly is crossprod(X,Y) preferred to t(X) %*% Y when X and Y are both matrices? The documentation says

Given matrices x and y as arguments, return a matrix cross-product. This is formally equivalent to (but usually slightly faster than) the call t(x) %*% y (crossprod) or x %*% t(y) (tcrossprod).

So when is it not faster? When searching online I found several sources that stated either that crossprod is generally preferred and should be used as default (e.g. here), or that it depends and results of comparisons are inconclusive. For example, Douglas Bates (2007) says that:

Note that in newer versions of R and the BLAS library (as of summer 2007), R’s %*% is able to detect the many zeros in mm and shortcut many operations, and is hence much faster for such a sparse matrix than crossprod which currently does not make use of such optimizations [...]

So when should I use %*% and when crossprod?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Tim
  • 7,075
  • 6
  • 29
  • 58
  • 1
    I think you pretty much answered it yourself. `crossprod` is C level so it is generally should be faster, while for a Sparse matrix, R seem to handle it in a more efficient way. – David Arenburg Mar 13 '17 at 21:01
  • @DavidArenburg but `%*%` is also `.Primitive()` and C level. So the answer is that `%*%` should be used only when expecting sparse matrices (or for code readability)? – Tim Mar 13 '17 at 21:17
  • 1
    But you need to transpose `X` in R first in order to use it. I agree that both of these function should be pretty much the same as directly in C, but like I said, you already said it yourself *This is formally equivalent to (but usually slightly faster than)* – David Arenburg Mar 13 '17 at 21:22

1 Answers1

10

I have briefly covered this issue in several my answers on statistical computing questions. The follow up part of my this answer is relatively comprehensive. Note, my discussion there, as well as the following really assumes that you know what BLAS is, especially what level-3 BLAS routines dgemm and dsyrk are.

My answer here, will provide some evidence and benchmarking to assure you of my argument over there. Additionally, explanation for Douglas Bates' comment will be given.

What do crossprod and "%*%" really do?

Let's check the source code for both routines. C level source code for R base package is largely found under

R-release/src/main

Particularly, matrix operations are defined in

R-release/src/main/array.c

Now,

  • "%*%" matches to a C routine matprod, which calls dgemm with transa = "N" and transb = "N";
  • crossprod is readily seen to be a composite function, matching to symcrossprod, crossprod, symtcrossprod and tcrossprod (counterparts for complex matrices, like ccrossprod, are not listed here).

You should now understand that crossprod avoids all explicit matrix transposition. crossprod(A, B) is cheaper than t(A) %*% B, as the latter needs an explicit matrix transposition for A. In the following, I will refer to this as transposition overhead.

R level profiling can expose this overhead. Consider the following example:

A <- matrix(ruinf(1000 * 1000), 1000)
B <- matrix(ruinf(1000 * 1000), 1000)

Rprof("brutal.out")
result <- t.default(A) %*% B
Rprof(NULL)
summaryRprof("brutal.out")

Rprof("smart.out")
result <- crossprod(A, B)
Rprof(NULL)
summaryRprof("smart.out")

Note how t.default comes into the profiling result for the first case.

Profiling also gives CPU time for the execution. You will see that both seems to have taken the same amount of time, as the overhead is insignificant. Now, I will show you when the overhead is a pain.

When is the transposition overhead significant?

Let A be a k * m matrix, and B a k * n matrix, then matrix multiplication A'B (' means transpose) has FLOP count (amount of floating point addition and multiplication) 2mnk. If you do t(A) %*% B, transposition overhead is mk (the number of elements in A), so the ratio

useful computation : overhead = 2n : 1

Unless n is big, transposition overhead can not be amortized in actual matrix multiplication.

The most extreme case is n = 1, i.e., you have a single-column matrix (or a vector) for B. Consider a benchmarking:

library(microbenchmark)
A <- matrix(runif(2000 * 2000), 2000)
x <- runif(2000)
microbenchmark(t.default(A) %*% x, crossprod(A, x))

You will see that crossprod is several times faster!

When is "%*%" structurally inferior?

As mentioned in my linked answer (as well as in Bates' notes with benchmarking result), if you do A'A, crossprod is sure to be 2 times faster.

What if you have sparse matrices?

Having sparse matrices does not change the fundamental conclusion above. R package Matrix for setting up sparse matrices also has sparse computation methods for "%*%" and crossprod. So you should still expect crossprod to be slightly faster.

So what does Bates' comment on BLAS "as of summer 2007" mean?

This is nothing related to sparse matrices. BLAS is strictly intended for dense numerical linear algebra.

What it relates to is the difference in the loop variants used inside Netlib's F77 reference dgemm. There are two loop variants for scheduling matrix-matrix multiplications op(A) * op(B) (the * here denotes matrix multiplication not elementwise product!), and the variant is completely decided by the transpose setting of op(A):

  • for op(A) = A', "inner product" version is used in the innermost loop, in which case no zero detection is possible;
  • for op(A) = A, "AXPY" version is used, and any zero in op(B) can be excluded from computation.

Now think about how R calls dgemm. The first case corresponds to crossprod, while the second case to "%*%" (as well as tcrossprod).

In this aspect, if your B matrix have a lot of zeros, while it is still in dense matrix format, then t(A) %*% B will be faster than crossprod(A, B), simply because the loop variant for the former is more efficient.

The most illuminating example is when B is a diagonal matrix, or a banded matrix. Let's consider a banded matrix (actually a symmetric tri-diagonal matrix here):

B_diag <- diag(runif(1000))
B_subdiag <- rbind(0, cbind(diag(runif(999)), 0))
B <- B_diag + B_subdiag + t(B_subdiag)

Now let A still be a full dense matrix

A <- matrix(runif(1000 * 1000), 1000)

Then compare

library(microbenchmark)
microbenchmark(t.default(A) %*% B, crossprod(A, B))

You will see that "%*%" is much faster!

I guess a better example is that matrix B is a triangular matrix. This is fairly common in practice. A triangular matrix is not seen as a sparse matrix (and should not be stored as a sparse one, either).

Caution: If you are using R with an optimized BLAS library, like OpenBLAS or Intel MKL, you will still see that crossprod is faster. However, this is really because the blocking & caching strategy in any optimized BLAS libraries destroys the loop variants scheduling pattern as in Netlib's F77 reference BLAS. As a result, any "zero-detection" is impossible. So what you will observe is that for this particular example, F77 reference BLAS is even faster than optimized BLAS.

Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248