Matrix Multiply > for loop > bsxfun - Odd Speed Results

Question

I have an n by n matrix A, a set of n coefficients k (n by 1), and matrix called row (1 by n). My goal is to subtract row weighted by the ith coefficient in k from the ith row of A. I have three questions: Why does my for-loop perform better than the built-in matrix multiply, in general, what explains each method's superiority over the next, and is there a better method than the three with which I came up?

% Define row & coefficients used in each method
k = (1:1000).';
row = 1:1000;

% Method 1 (matrix multiply) ~15 seconds
A = magic(1e3);
tic;
for z = 1:1e3
    A = A - k*row;
end
toc;
% Method 2 (for loop) ~11 seconds
B = magic(1e3);
tic;
for z = 1:1e3
    for cr = 1:1000
        B(cr,:) = B(cr,:) - k(cr)*row;
    end
end
toc;
% method 3 (bsxfun) ~ 4 seconds
C = magic(1e3);
tic;
for z = 1:1e3
    C = C - bsxfun(@times, k, row);
end
toc

isequal(A,B)
isequal(A,C)

Note, I am doing these row subtractions in an algorithm. I simplified the code a bit, creating this toy test case, but the crux of the computation is still present. Also, to avoid confusion, the for loop with z is used to make the time larger.

Yes, same here. But the for loop is faster than method 1, which uses matrix multiply, right? I thought I was pretty slick doing that matrix multiply method, but it turned out to be the worst method out of the bunch! — user904963, Sep 23 '12 at 17:10

score 6 · Accepted Answer · edited May 23 '17 at 11:43

Short answer: faster version of your code is this:

tic;
for z = 1:1e3
    for cr = 1:1000
        B(:,cr) = B(:,cr) - k*row(cr);
    end
end
toc;

You might want to have a look at my previous answer to this question. In short, your loop operated on rows, while MATLAB is column based. This means that columns are contiguous in the memory. Your original loop iterated over rows, which is inefficient.

Execution times on my computer:

% A - k*row
Elapsed time is 4.370238 seconds.
% B(cr,:) = B(cr,:) - k(cr)*row;
Elapsed time is 9.537338 seconds.
% C = C - bsxfun(@times, k, row);
Elapsed time is 3.039836 seconds.
B(:,cr) = B(:,cr) - k*row(cr);
Elapsed time is 2.028186 seconds.

Explanation. Your first version is not a matrix multiply, but an outer product of two vectors which results in a matrix of size 1000 x 1000. The hole computation is a BLAS2 rank 1 update (A=alphaxy'+A is a GER function). The problem most likely is that MATLAB does not recognize it as such and instead runs the code as it understands it, i.e. explicitly executing all operations including k*row. This is exactly the problem with this solution. Outer product has to allocate additional memory of size equal to the size of your matrix, which itself takes time. Consider this:

memory allocation - since we do not know how MATLAB manages memory allocation, this could mean a lot of overhead.
read vectors k, row
writing of the result matrix (KR)
reading the KR to subtract it from A
reading and writing of A

Two 1000*1000 matrices is 16 MB - I doubt that you have so much cache. That is the reason why this version is not the best, and might actually be slower than the 'memory inefficient' loop, depending on the available memory bandwidth and CPU cache size.

There is no need to allocate the KR matrix and store the values explicitly in the memory - you can compute the needed products in a loop. Hence, you effectively half the memory bandwidth requirements and remove the memory allocation overhead!

read vectors k, row
read and write A

Assuming that one vector fits into the cache (which it does - 1000*8 bytes is not much) you read k and row from the memory only once. Now the algorithm with the loop over columns makes perfect sense (this is likely how BLAS implements this computation)

read k and row and keep them in the cache
stream A with full memory bandwidth to the CPU, subtract k*row product, stream back to memory

Now the final efficiency considerations. Take my loop structure. In every iteration we read and write A, and read the vectors. That is 16MB of data moved per iteration. 1000 iterations gives 16 GB of data moved in total. Two seconds required to compute the result gives 8GB/s effective memory bandwidth. My system has 16GB/s stream bandwidth when using 2 CPU cores, around 11-12 GB/s when using one. Hence, this sequential loop runs at 60-70% efficiency on one core. Not bad, considering this is a MATLAB loop :)

Note also that, at least on my computer, column-based loop version is (a bit more than) twice faster than the A-krow implementation (2s vs 4.37s). This strongly indicates that krow is indeed explicitly executed and constructed by MATLAB, and the total memory traffic is two times larger than in the looped version. Hence twice worse performance.

Edit You can still try to do it like in the first algorithm by calling directly the corresponding BLAS function. Have a look at this FEX contribution. It allows you to call BLAS and LAPACK functions directly from MATLAB. Might be useful..

Am I reading it right that the for loop was slower than A-k*row? On my computer, the A-k*row took 15 seconds (slowest of all as commented)! I'm using 32-bit 2012a. What version are you using? Also, the code I am working on is an assignment to code LU decomposition for educational purposes (i.e. Gauss-elimination), so I probably can't (comfortably) use the column technique. But thank you for the new perspective. — user904963, Sep 23 '12 at 18:33
1000x1000 elements really ought not to be 16GB... did you mean MB? — tmpearce, Sep 23 '12 at 18:46

Matrix Multiply > for loop > bsxfun - Odd Speed Results

1 Answers1