Does anyone know how to speed up this simple code running in parallel (with parfor)?

Question

I do time-consuming simulations involving the following (simplified) code:

K=10^5; % large number
L=1000; % smaller number

a=rand(K,L);
b=rand(K,L);
c=rand(L,L);
d=zeros(K,L,L);

parfor m=1:L-1
    
    e=zeros(K,L);
    
    for n=m:L-1
        
        e(:,n+1)=e(:,n)+(n==m)+e(:,n).*a(:,n).*b(:,n)+a(:,1:n)*c(n,1:n)';
        
    end
    
    d(:,:,m)=e;
end

Since each worker requires matrices a and b and c, there is a large parallel overhead.

The overhead is smaller if I send each worker only the parts of the matrix b it needs (since the inner loop starts at m), but that doesn't make the code very much faster, I think.

Because of the large overhead, parfor is slower than the serial for-loop. As parfor iterations increase (increasing L), the sizes of a, b, and c also increase, and so does the overhead. Therefore, I do not expect the parfor loop to be faster even for large values of L. Or does anyone see it differently?

You probably should be using a tread-based parallel pool. See here: https://www.mathworks.com/help/parallel-computing/choose-between-thread-based-and-process-based-environments.html — Cris Luengo, Jun 09 '23 at 19:15
Thanks @cris . Matlab uses built-in multithreading, so it is not necessary to use a thread-based parallel pool on a local machine. Unfortunately, my cluster is running a matlab version that does not support a thread-based parallel pool (not until matlab 2020a). Thanks anyway! — Lena K., Jun 12 '23 at 14:30
I think you misunderstand. Yes, MATLAB uses multi-threading internally for most matrix operations, but additional parallelism is typically beneficial. If MATLAB is not using 100% of your CPU, adding parallelism is useful. Whether you do this with a thread-based or a process-based pool doesn't matter, it's using the same CPU resources. The differences between threads and processes are clearly explained in the linked document, processes have separate memory, and so data has to be copied around; threads share memory, no data is copied. — Cris Luengo, Jun 12 '23 at 14:44
Maybe using [`parallel.pool.Constant`](https://www.mathworks.com/help/parallel-computing/parallel.pool.constant.html) you can avoid repeatedly copying the same data. This one was added in R2015b according to the docs, I hope your version of MATLAB is not older than that! :) — Cris Luengo, Jun 12 '23 at 14:56
When I look at the load on the cores of my local computer for a serial job, it is always greater than for a parallel job because of the built-in multithreading. parallel.pool.Constant works well, though, and makes parallel jobs much faster. Thanks @CrisLuengo for the tip! — Lena K., Jun 13 '23 at 08:44

rahnema1 · Accepted Answer · 2023-06-10T04:15:46.000

1

There may be a performance gain using pre-computation:

tc = tril(c);
ac = a * tc.';
ab = a .* b;
for m=1:L-1
    e = zeros(K,L);
    for n=m:L-1
        e(:, n + 1) = e(:, n) + (n==m) + e(:, n) .* ab(:, n) + ac(:, n);
    end
    d(:,:,m) = e;
end

edited Jun 10 '23 at 04:15

answered Jun 10 '23 at 03:53

rahnema1

15,264
3
15
27

thanks, you mean parfor instead of for in the outer loop, right? I check if it is faster than the serial for-loop.. – Lena K. Jun 10 '23 at 07:37
I think using precomputation results in speedup both in for and parfor loops. – rahnema1 Jun 10 '23 at 08:23
Thanks @rahnema1, I was not aware that precomputation makes such a big difference! If the following is changed in the second summand `e(:,n+1)=e(:,n)+(n==m)+e(:,n).*a(:,n).*b(:,n)+(a(:,n)-a(:,1:n)).*c(n,1:n);` the following precomputations speed up the code enormously `ab = a .* b;` and `ac=(reshape(a,K,1,L)-a).*reshape(c,1,L,L);` with the result `e(:,n+1)=e(:,n)+(n==m)+e(:,n).*ab(:,n)+ac(:,n,1:n);` do you have a better idea? – Lena K. Jun 12 '23 at 14:41
1

At first I thought it must be possible to re-use `e` or parts of it for different `m`, but maybe it's not possible after all. – Cris Luengo Jun 12 '23 at 15:02
@LenaK. I think that will not work. The expression `(a(:,n)-a(:,1:n))` creates a matrix of size `[K x n]` that cannot be placed in a slice of size `[K x 1]`. – rahnema1 Jun 13 '23 at 03:39
@CrisLuengo You are right. The result of the recursion depends on the previous computations. – rahnema1 Jun 13 '23 at 03:41
True @rahnema1. My correct code is much more complicated and I did not simplify it correctly. This is how it should be correct: `e(:,n+1)=e(:,n)+(n==m) +e(:,n).*a(:,n).*b(:,n)+ sum(((a(:,n)-a(:,1:n)).*c(n,1:n)).*e(:,1:n),2);` becomes `e(:,n+1)=e(:,n)+(n==m)+e(:,n).*ab(:,n)+sum(squeeze(ac(:,n,1:n)).*e(:,1:n),2);` – Lena K. Jun 13 '23 at 08:49

Does anyone know how to speed up this simple code running in parallel (with parfor)?

1 Answers1