In my line of work (econometrics/statistics), I frequently have to multiply matrices of different sizes and then perform additional operations on the resulting matrix. I have always relied on bsxfun()
to vectorize the code, which in general I find it to be more efficient than repmat()
. But what I don't understand is why sometimes the performance of bsxfun()
can be very different when expanding the matrices along different dimensions.
Consider this specific example:
x = ones(j, k, m);
beta = rand(k, m, s);
exp_xBeta = zeros(j, m, s);
for im = 1 : m
for is = 1 : s
xBeta = x(:, :, im) * beta(:, im, is);
exp_xBeta(:, im, is) = exp(xBeta);
end
end
y = mean(exp_xBeta, 3);
Context:
We have data from m markets and within each market we want to calculate the expectation of exp(X * beta) where X is a j x k matrix, and beta is a k x 1 random vector. We compute this expectation by monte-carlo integration - make s draws of beta, compute exp(X * beta) for each draw, and then take the mean. Typically we get data with m > k > j, and we use a very large s. In this example I simply let X to be a matrix of ones.
I did 3 versions of vectorization using bsxfun()
, they differ by how X and beta are shaped:
Vectorization 1
x1 = x; % size [ j k m 1 ]
beta1 = permute(beta, [4 1 2 3]); % size [ 1 k m s ]
tic
xBeta = bsxfun(@times, x1, beta1);
exp_xBeta = exp(sum(xBeta, 2));
y1 = permute(mean(exp_xBeta, 4), [1 3 2 4]); % size [ j m ]
time1 = toc;
Vectorization 2
x2 = permute(x, [4 1 2 3]); % size [ 1 j k m ]
beta2 = permute(beta, [3 4 1 2]); % size [ s 1 k m ]
tic
xBeta = bsxfun(@times, x2, beta2);
exp_xBeta = exp(sum(xBeta, 3));
y2 = permute(mean(exp_xBeta, 1), [2 4 1 3]); % size [ j m ]
time2 = toc;
Vectorization 3
x3 = permute(x, [2 1 3 4]); % size [ k j m 1 ]
beta3 = permute(beta, [1 4 2 3]); % size [ k 1 m s ]
tic
xBeta = bsxfun(@times, x3, beta3);
exp_xBeta = exp(sum(xBeta, 1));
y3 = permute(mean(exp_xBeta, 4), [2 3 1 4]); % size [ j m ]
time3 = toc;
And this is how they performed (typically we get data with m > k > j, and we used a very large s):
j = 5, k = 15, m = 100, s = 2000:
For-loop version took 0.7286 seconds.
Vectorized version 1 took 0.0735 seconds.
Vectorized version 2 took 0.0369 seconds.
Vectorized version 3 took 0.0503 seconds.
j = 10, k = 15, m = 150, s = 5000:
For-loop version took 2.7815 seconds.
Vectorized version 1 took 0.3565 seconds.
Vectorized version 2 took 0.2657 seconds.
Vectorized version 3 took 0.3433 seconds.
j = 15, k = 35, m = 150, s = 5000:
For-loop version took 3.4881 seconds.
Vectorized version 1 took 1.0687 seconds.
Vectorized version 2 took 0.8465 seconds.
Vectorized version 3 took 0.9414 seconds.
Why is version 2 consistently always the fastest version? Initially, I thought the performance advantage was because s was set to dimension 1, which Matlab might be able to compute faster since it stored data in column-major order. But Matlab's profiler told me that the time taken to compute that mean was rather insignificant and was more or less the same among all 3 versions. Matlab spent most of the time evaluating the line with bsxfun()
, and that's also where the run-time difference was the biggest among the 3 versions.
Any thought on why version 1 is always the slowest and version 2 is always the fastest?
I've updated my test code here: Code
EDIT: earlier version of this post was incorrect. beta
should be of size (k, m, s)
.