Matlab's bsxfun() - what explains the performance differences when expanding along different dimensions?

Question

In my line of work (econometrics/statistics), I frequently have to multiply matrices of different sizes and then perform additional operations on the resulting matrix. I have always relied on bsxfun() to vectorize the code, which in general I find it to be more efficient than repmat(). But what I don't understand is why sometimes the performance of bsxfun() can be very different when expanding the matrices along different dimensions.

Consider this specific example:

x      = ones(j, k, m);
beta   = rand(k, m, s);

exp_xBeta   = zeros(j, m, s);
for im = 1 : m
    for is = 1 : s
        xBeta                = x(:, :, im) * beta(:, im, is);
        exp_xBeta(:, im, is) = exp(xBeta);
    end
end

y = mean(exp_xBeta, 3);

Context:

We have data from m markets and within each market we want to calculate the expectation of exp(X * beta) where X is a j x k matrix, and beta is a k x 1 random vector. We compute this expectation by monte-carlo integration - make s draws of beta, compute exp(X * beta) for each draw, and then take the mean. Typically we get data with m > k > j, and we use a very large s. In this example I simply let X to be a matrix of ones.

I did 3 versions of vectorization using bsxfun(), they differ by how X and beta are shaped:

Vectorization 1

x1      = x;                                    % size [ j k m 1 ]
beta1   = permute(beta, [4 1 2 3]);             % size [ 1 k m s ]

tic
xBeta       = bsxfun(@times, x1, beta1);
exp_xBeta   = exp(sum(xBeta, 2));
y1          = permute(mean(exp_xBeta, 4), [1 3 2 4]);   % size [ j m ]
time1       = toc;

Vectorization 2

x2      = permute(x, [4 1 2 3]);                % size [ 1 j k m ]
beta2   = permute(beta, [3 4 1 2]);             % size [ s 1 k m ]

tic
xBeta       = bsxfun(@times, x2, beta2);
exp_xBeta   = exp(sum(xBeta, 3));
y2          = permute(mean(exp_xBeta, 1), [2 4 1 3]);   % size [ j m ]
time2       = toc;

Vectorization 3

x3      = permute(x, [2 1 3 4]);                % size [ k j m 1 ]
beta3   = permute(beta, [1 4 2 3]);             % size [ k 1 m s ]

tic
xBeta       = bsxfun(@times, x3, beta3);
exp_xBeta   = exp(sum(xBeta, 1));
y3          = permute(mean(exp_xBeta, 4), [2 3 1 4]);    % size [ j m ]
time3       = toc;

And this is how they performed (typically we get data with m > k > j, and we used a very large s):

j = 5, k = 15, m = 100, s = 2000:

For-loop version took 0.7286 seconds.
Vectorized version 1 took 0.0735 seconds.
Vectorized version 2 took 0.0369 seconds.
Vectorized version 3 took 0.0503 seconds.

j = 10, k = 15, m = 150, s = 5000:

For-loop version took 2.7815 seconds.
Vectorized version 1 took 0.3565 seconds.
Vectorized version 2 took 0.2657 seconds.
Vectorized version 3 took 0.3433 seconds.

j = 15, k = 35, m = 150, s = 5000:

For-loop version took 3.4881 seconds.
Vectorized version 1 took 1.0687 seconds.
Vectorized version 2 took 0.8465 seconds.
Vectorized version 3 took 0.9414 seconds.

Why is version 2 consistently always the fastest version? Initially, I thought the performance advantage was because s was set to dimension 1, which Matlab might be able to compute faster since it stored data in column-major order. But Matlab's profiler told me that the time taken to compute that mean was rather insignificant and was more or less the same among all 3 versions. Matlab spent most of the time evaluating the line with bsxfun(), and that's also where the run-time difference was the biggest among the 3 versions.

Any thought on why version 1 is always the slowest and version 2 is always the fastest?

I've updated my test code here: Code

EDIT: earlier version of this post was incorrect. beta should be of size (k, m, s).

score 3 · Answer 1 · edited May 23 '17 at 12:30

3

bsxfun is of course one of the good tools to vectorize things, but if you can somehow introduce matrix-multiplication that would be best way to go about it, as matrix multiplications are really fast on MATLAB.

It seems here you can use matrix-multiplication to get exp_xBeta like so -

[m1,n1,r1] = size(x);
n2 = size(beta,2);
exp_xBeta_matmult = reshape(exp(reshape(permute(x,[1 3 2]),[],n1)*beta),m1,r1,n2)

Or directly get y as shown below -

y_matmult = reshape(mean(exp(reshape(permute(x,[1 3 2]),[],n1)*beta),2),m1,r1)

Explanation

To explain it in a bit more detail, we have the sizes as -

x      : (j, k, m)
beta   : (k, s)

Our end goal is to use the "eliminate" the k's from x and beta using matrix-multiplication. So, we can "push" the k in x to the end with permute and reshape to a 2D keeping k as the rows, i.e. ( j * m , k ) and then perform matrix-multiplication with beta ( k , s ) to give us ( j * m , s ). The product can then be reshaped to a 3D array ( j , m , s ) and perform elementwise exponential which would be exp_xBeta.

Now, if the final goal is y, which is getting the mean along the third dimension of exp_xBeta, it would be equivalent to calculating the mean along the rows of the matrix-multiplication product (j * m, s ) and then reshaping to ( j , m ) to get us y directly.

edited May 23 '17 at 12:30

Community

1
1

answered May 23 '15 at 04:35

Divakar

218,885
19
262
358

1

Thanks, using your approach it took less than 0.05 seconds to run. I was under the impression that bsxfun() can perform as fast as matrix multiplication, clearly I was wrong. But I've noticed a mistake in my example code, beta should also be different in each *m*, that is, beta should be a k-by-m-by-s matrix, seems like that would make it much tedious to stack the matrix up to do matrix-multiplication... – IvanT May 23 '15 at 05:24
@IvanT So, `beta` would be (k,m,s)? Can you edit the codes in the question accordingly? – Divakar May 23 '15 at 05:34
@IvanT Or would be better to append that as a new section? – Divakar May 23 '15 at 05:36
1

I've updated the post and re-ran my code. It seems like the runtimes are more or less the same. The question remains why using `bsxfun()` the way in version 2 is faster than the other 2 approaches. What's the underlying implementation details of `bsxfun()` that causes such difference? – IvanT May 23 '15 at 05:52

score 1 · Answer 2 · answered May 23 '15 at 23:43

I did some more experiments this morning. It seems that it has to do with the fact that Matlab stores data in column major order after all.

In doing these experiments, I also added vectorization version 4 which does the same thing but orders the dimensions slightly different than versions 1-3.

To recap, here are how x and beta are ordered in all 4 versions:

Vectorization 1:

x       :   (j, k, m, 1)
beta    :   (1, k, m, s)

Vectorization 2:

x       :   (1, j, k, m)
beta    :   (s, 1, k, m)

Vectorization 3:

x       :   (k, j, m, 1)
beta    :   (k, 1, m, s)

Vectorization 4:

x       :   (1, k, j, m)
beta    :   (s, k, 1, m)

code : bsxfun_test.m

The two most costly operations in this code are:

(a) xBeta = bsxfun(@times, x, beta);

(b) exp_xBeta = exp(sum(xBeta, dimK));

where dimK is the dimension of k.

In (a), bsxfun() has to expand x along the dimension of s and beta along the dimension of j. When s is much larger than other dimensions, we should see some performance advantage in vectorizations 2 and 4, since they assign s as the first dimension.

j = 100; k = 100; m = 100; s = 1000;

Vectorized version 1 took 2.4719 seconds.
Vectorized version 2 took 2.1419 seconds.
Vectorized version 3 took 2.5071 seconds.
Vectorized version 4 took 2.0825 seconds.

If instead s is trivial and k is huge, then vectorization 3 should be the fastest since it puts k in dimension 1:

j = 10; k = 10000; m = 100; s = 1;

Vectorized version 1 took 0.0329 seconds.
Vectorized version 2 took 0.1442 seconds.
Vectorized version 3 took 0.0253 seconds.
Vectorized version 4 took 0.1415 seconds.

If we swap the value of k and j in the last example, vectorization 1 becomems the fastest since j is assigned to dimension 1:

j = 10000; k = 10; m = 100; s = 1;

Vectorized version 1 took 0.0316 seconds.
Vectorized version 2 took 0.1402 seconds.
Vectorized version 3 took 0.0385 seconds.
Vectorized version 4 took 0.1608 seconds.

But in general when k and j are close, j > k does not necessary imply vectorization 1 is faster than vectorization 3 since the operations performed in (a) and (b) are different.

In practice, I often have to run computation with s >>>> m > k > j. In such cases, it seems that ordering them in vectorization 2 or 4 gives the best results:

    j = 10; k = 30; m = 100; s = 5000;

Vectorized version 1 took 0.4621 seconds.
Vectorized version 2 took 0.3373 seconds.
Vectorized version 3 took 0.3713 seconds.
Vectorized version 4 took 0.3533 seconds.

    j = 15; k = 50; m = 150; s = 5000;

Vectorized version 1 took 1.5416 seconds.
Vectorized version 2 took 1.2143 seconds.
Vectorized version 3 took 1.2842 seconds.
Vectorized version 4 took 1.2684 seconds.

Takeaway: if bsxfun() has to expand along a dimension of size much bigger than other dimensions, assign that dimension to dimension 1!

score 1 · Answer 3 · edited May 23 '17 at 12:21

Refer this other question and answer

If you are gonna process matrices of different dimensions using bsxfun, make sure that the biggest dimension of the matrices is kept in first dimension.

Here is my small example test:

%// Inputs
%// Taking one very big and one small vector, so that the difference could be seen clearly
a = rand(1000000,1);
b = rand(1,5);

%//---------------- testing with inbuilt function
%// preferred orientation [1]
t1 = timeit(@() bsxfun(@times, a, b))

%// not preferred [2]
t2 = timeit(@() bsxfun(@times, b.', a.'))

%//---------------- testing with anonymous function
%// preferred orientation [1]
t3 = timeit(@() bsxfun(@(x,y) x*y, a, b))

%// not preferred [2]
t4 = timeit(@() bsxfun(@(x,y) x*y, b.', a.'))

^{[1] Preferred orientation - larger dimension as first dimension

[2] Not preferred - smaller dimension as first dimension}

Small Note: The output given by all four methods are same even though their dimensions may differ.

Results:

t1 =
0.0461

t2 =
0.0491

t3 =
0.0740

t4 =
7.5249

>> t4/t3
ans =
101.6878

Method 3 is roughly 100 times faster than Method 4

To conclude: Although the difference between preferred and unfavored orientation for built-in function is minimum, The difference becomes huge for anonymous function. So it might be a best practice to use bigger dimension as dimension 1.

Matlab's bsxfun() - what explains the performance differences when expanding along different dimensions?

3 Answers3