Parallel implementation for Jacobi algorithm takes too much time

Question

I implemented a parallel version of Jacobi's method for the resolution of a linear system. Doing some tests I noticed that the time to execute the function in parallel is very high compared to the time to execute the sequential function. This is strange because the Jacobi's method should be faster when executed with a parallel implementation.

I think I'm doing something wrong in the code:

function [x,niter,resrel] = Parallel_Jacobi(A,b,TOL,MAXITER)
[n, m] = size(A); 
D = 1./spdiags(A,0);
B = speye(n)-A./spdiags(A,0);
C= D.*b;
x0=sparse(zeros(length(A),1));
spmd

    cod_vett=codistributor1d(1,codistributor1d.unsetPartition,[n,1]);
    cod_mat=codistributor1d(1,codistributor1d.unsetPartition,[n,m]);

    B= codistributed(B,cod_mat);
    C= codistributed(C,cod_vett);
    x= codistributed(B*x0 + C,cod_vett);

    Niter = 1; 
    TOLX = TOL;  
    while(norm(x-x0,Inf) > norm(x0,Inf)*TOLX && Niter < MAXITER)
        if(TOL*norm(x,Inf) > realmin)
            TOLX = norm(x,Inf)*TOL;
        else
            TOLX = realmin;
        end    
        x0 = x;
        x = B*x0 + C;
        Niter=Niter+1;
    end
end
Niter=Niter{1}; 
x=gather(x);
end

Below there are the tests

%sequential Jacobi
format long;
A = gallery('poisson',20);
tic;
x= jacobi(A,ones(400,1),1e-6,2000000);
toc;
Elapsed time is 0.009054 seconds.
%parallel Jacobi
format long;
A = gallery('poisson',20);
tic;
x= Parallel_Jacobi(A,ones(400,1),1e-6,2000000);
toc;
Elapsed time is 11.484130 seconds.

I timed the parpool function with 1,2,3 and 4 workers (I have a quad core processor) and the result is the following:

%Test
format long;
A = gallery('poisson',20);
delete(gcp('nocreate'));
tic
%parpool(1/2/3/4) means that i executed 4 tests that differ only for the 
%argument in the function: first parpool(1), second parpool(2) and so on.
parpool(1/2/3/4);
toc
tic;
x= Parallel_Jacobi(A,ones(400,1),1e-6,2000000);
toc;

4 workers: parpool=13.322899 seconds, function=23.772271 

3 workers: parpool=10.911769 seconds, function=16.402633 

2 workers: parpool=9.371729 seconds, function=12.945154 

1 worker: parpool=8.460357 seconds, function=7.982958 .

The less workers, the better the time is. Which is, like @Adriaan said, likely due overhead.

Does this mean that, in this case, the sequential function is always faster than the parallel function? Or is there a better way to implement the parallel one?

In this question it is said that the performance in parallel is better when the number of iterations is high. In my case, with this test, there are only 32 iteration.

The sequential implementation of Jacobi's method is this:

function [x,niter,resrel] = jacobi(A,b,TOL,MAXITER)
n = size(A,1); 
D = 1./spdiags(A,0);
B = speye(n)-A./spdiags(A,0);
C= D.*b;

x0=sparse(zeros(length(A),1));
x = B*x0 + C;
Niter = 1; 
TOLX = TOL;  

while(norm(x-x0,Inf) > norm(x0,Inf)*TOLX && Niter < MAXITER) 
    if(TOL*norm(x,Inf) > realmin)
        TOLX = norm(x,Inf)*TOL;
    else
        TOLX = realmin;
    end    

    x0 = x;
    x = B*x0 + C;

    Niter=Niter+1;
end
end

I timed the code with the timeit function and the results are these(the inputs are the same of the previous):

4 workers: 11.693473075964102

3 workers: 9.221281335264003

2 workers: 9.150417240778545

1 worker: 6.047181992020434

sequential: 0.002893932969688

How many parallel pools are you creating? Are you creating them implicitly? you may be timing how much time the parpool is taking to switch on, together with the computational time. — Ander Biguri, May 21 '19 at 09:55
You might want to read [this question](https://stackoverflow.com/q/32146555/5211833) and answers thereon as general background information, because parallellisation is not a magic wand (the question is about `parfor`, `spmd`, however, holds to the same general principles) — Adriaan, May 21 '19 at 09:57
What is the timing when you have already opened a parallel pool? Seeing the 1000x difference in execution time, it makes me think that A) you are opening your pool within the parallel function (takes ~10 seconds usually) and B) the execution time in serial is already so fast, that you are essentially overhead limited (see the question I linked in my previous comment) — Adriaan, May 21 '19 at 10:01
So you already discovered that indeed the function is much slower in parallel (I still think there's something off with your timing, since execution time should go down when using more workers). Why not stick to the serial one? We can't help much more, because you haven't shown us the serial implementation. Also, `tic/toc` are not recommended for proper timing, use [`timeit()`](https://mathworks.com/help/matlab/ref/timeit.html) instead. — Adriaan, May 21 '19 at 14:39
My best guess is that none of this has to do with timing and parallel pools and such. I think your code is wrong and its not doing what you think its doing. — Ander Biguri, May 21 '19 at 14:45
There is not this much overhead in starting a parallel pool. The overhead I think is because you are spreading your matrix across a pool of workers, but to do the computations, each worker needs all of the data, and so there is a lot of overhead in communication of computational results between the workers. The MATLAB Parallel Toolbox does not use shared memory parallelism. — Cris Luengo, May 21 '19 at 14:47
`codistributed` and friends is intended for working with huge matrices that do not fit in memory. You can distribute the matrix over a pool or workers (==different computers) to get the work done. It is not about speeding up computation, it is about memory. — Cris Luengo, May 21 '19 at 14:48

Parallel implementation for Jacobi algorithm takes too much time

0 Answers0