5

Given a matrix A, I need to multiply with other n vectors Bi (i.e. i=1...n). The size of A can be like 5000x5000 and thus Bi like 5000x1.

If I evaluate the product in the following way:

for i=1:n
    product=A*Bi;
    % do something with product
end

The result is way (orders of magnitude) slower than computing the products like:

%assume that S is a matrix that contains the vectors Bi as columns, i.e. S(:,i)=Bi, then:
results=A*S;   %stores all the products in matrix form
% do something with results

The problem is that the number n of vectors Bi may be too big to be stored in memory, for example n=300000, so I need to use a loop approach where each time I evaluate the product, use it and then discard the vector Bi.

Why is such an approach so slow compared to direct multiplication, and are there ways to overcome this?

Adriaan
  • 17,741
  • 7
  • 42
  • 75
xanz
  • 221
  • 3
  • 10
  • 3
    Good read on this topic is [Why is MATLAB so fast in matrix multiplication?](http://stackoverflow.com/questions/6058139/why-is-matlab-so-fast-in-matrix-multiplication) – Adriaan Jan 08 '16 at 11:37
  • Seriously, mathworks should do a proper benchmark on this and print it in large neon green letters somewhere. This question have been asked a lot of times and is still being asked. Clearly the answers on the web are not good enough for some people, so why does not mathworks (with better insight in the source) try to do this? @xarz No pun for asking. If the answers on the web does not satisfy, there are apparently no good enough answer to the question. – patrik Jan 11 '16 at 15:33
  • @patrik Maybe you're right, but I looked on stackoverflow and I didn't find a topic dealing with this exact problem. By the way, if you can link here some references that deal with this exact problem they may be useful for future readers. Thank you. – xanz Jan 13 '16 at 18:43

4 Answers4

4

You could try loop over batches so for example

for i = 0:(n/k)-1
    product = A*S(:,(i*k+1):(i+1)*k)
end

And adjust k to find the best trade off of speed and memory for you.

MATLAB's loops are slow because it is an interpreted language. So it has to work out a lot of stuff on the fly. The loops are greatly improved these days because of the JIT compiler, but they are still slow compared to the built-in functions which are written and compiled in C. Further to that, they use really cutting edge super fast matrix multiplication algorithms, as compared with your fairly naive algorithm achieved by looping, which also aid in the speed-up you are experiencing.

Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    You could also go parallel in this case, by simply using `parfor`. Given enough cores and complexity (or size) of the calculations that might be a significant speed boost. – Adriaan Jan 08 '16 at 11:38
  • 1
    @Adriaan probably worth adding that as another answer. Although if I'm not mistaken, the `*` operator is already parallelized so it's hard to predict what kind of speed up to expect and also it won't help with the memory constraints. – Dan Jan 08 '16 at 11:40
  • 1
    There is a bug in the slicing, it starts with k and has every k-th element twice. – Daniel Jan 08 '16 at 11:49
  • @Daniel ya I didn't actually work it out, was more of an illustration. I'll try correct it though – Dan Jan 08 '16 at 12:10
  • I used such an approach, and I think it's the best. I wrote a method that multiplies a batch of 50000 vectors Bi, and then stores the results to allow an external function to retreive the each product. When the buffer is empty it fills it again. Basically is what you suggested! – xanz Jan 08 '16 at 12:13
3

For simplicity my answer will assume a n-by-n square matrix A, but it is true for non squares as well.

Your loop approach uses matrix vector multiplication. The naive solution is also the best known, resulting in a runtime of O(n^2) which is repeated n times. You end up with a total runtime of O(n^3).

For matrix multiplication, there is a better approach. The best known algorithm only requires little less than O(n^2.4) runtime, making it much faster for large number.

You will achieve a better runtime when multiplying multiple vectors Bi at once using matrix multiplication. This will not achieve the performance of a pure matrix multiplication, but working with larger slices of b is probably the fastest memory efficient solution.

Some Code for the different discussed approaches:

n=5000;
k=100;
A=rand(n,n);
S=rand(n,n);
workers=matlabpool('size');
%for a parfor solution, the batch size must be smaller because multiple batches are stred in memory at once
kparallel=k/workers;
disp('simple loop:');
tic;
for i = 1:n
    product = A*S(:,n);
end
toc
disp('batched loop:');
tic;
for i = 1:(n/k)
    product = A*S(:,(i-1)*k+1:(i)*k);
end
toc
disp('batched parfor loop:');
tic;
parfor i = 1:(n/kparallel)
    product = A*S(:,(i-1)*kparallel+1:(i)*kparallel);
end
toc
disp('matrix multiplication:');
tic;
A*S;
toc
Daniel
  • 36,610
  • 3
  • 36
  • 69
1

In addition to @Dan's answer you might try going parallel, provided you have enough cores and large enough operations to make it profitable (see this answer for more details on memory consumption of parfor):

parfor ii = 0:(n/k)-1
    product = A*S(:,(ii*k+1):(ii+1)*k)
end

I can't see in the docs on mtimes (the * operator) whether it is implicitly multithreaded, but it's worth a shot I guess.

Community
  • 1
  • 1
Adriaan
  • 17,741
  • 7
  • 42
  • 75
  • 2
    Instead of using `parfor`, I would recommend to use larger batch sizes until the memory limit is reached. That will be faster. – Daniel Jan 08 '16 at 11:48
  • @xarz then what Daniel said is indeed the case; matrix multiplication is so fast that it's better to process in batches as large as possible than to go parallel. – Adriaan Jan 08 '16 at 13:41
0

In order to do the multiplications of each array with the matrix just multiplicate the matrix with one matrix that will have as columns the arrays you want.

So if you want to check this

if

size(a)=3,3

then

 a*b==horzcat(a*b(:,1),a*b(:,2),a*b(:,3)) 

is true

In this way you save a lot of time from loop

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459