This might disappoint you but at R level, this is already the best you can get without writing some C code yourself. The problem is that by doing rowSums(A * B)
, you are effectively doing
C <- A * B
rowSums(C)
The first line performs a full scan of three large tall-thin matrices; while the second line performs a full scan of 1 large tall-thin matrix. So altogether, we equivalently scan a tall-thin matrix 4 times (memory intensive).
In fact, for such operation, the optimal algorithm only needs scanning a n * p
tall-thin matrix twice, by doing rowwise cross product:
rowsum <- numeric(n)
for j = 1, 2, ... p
rowsum += A[,i] * B[,i]
In this way, we also avoid generating matrix C
. Note, the above is just a fake code rather than valid R code or even C code. But the idea is clear, and we want to program this in C.
An analogy to your situation is the speed difference between sum(x * y)
and crossprod(x, y)
, assuming x
and y
be large vectors of the same length.
x <- runif(1e+7)
y <- runif(1e+7)
system.time(sum(x * y))
# user system elapsed
# 0.124 0.032 0.158
system.time(crossprod(x, y))
# user system elapsed
# 0.036 0.000 0.056
In the first case, we scan a long vector 4 times, while in the second case, we only scan it twice.
Relevance in statistical computing
rowSums(A * B)
is in fact an efficient evaluation of diag(tcrossprod(A, B))
, commonly seen in regression computing associated with point-wise prediction variance. For example, in ordinary linear squares regression with thin Q
matrix from QR factorization of model matrix, the point-wise variance of fitted values are diag(tcrossprod(Q))
, which is more efficiently computed by rowSums(Q ^ 2)
. But yet, this is still not the fastest evaluation, for reasons already explained.