I implemented a parallel version of Jacobi's method for the resolution of a linear system. Doing some tests I noticed that the time to execute the function in parallel is very high compared to the time to execute the sequential function. This is strange because the Jacobi's method should be faster when executed with a parallel implementation.
I think I'm doing something wrong in the code:
function [x,niter,resrel] = Parallel_Jacobi(A,b,TOL,MAXITER)
[n, m] = size(A);
D = 1./spdiags(A,0);
B = speye(n)-A./spdiags(A,0);
C= D.*b;
x0=sparse(zeros(length(A),1));
spmd
cod_vett=codistributor1d(1,codistributor1d.unsetPartition,[n,1]);
cod_mat=codistributor1d(1,codistributor1d.unsetPartition,[n,m]);
B= codistributed(B,cod_mat);
C= codistributed(C,cod_vett);
x= codistributed(B*x0 + C,cod_vett);
Niter = 1;
TOLX = TOL;
while(norm(x-x0,Inf) > norm(x0,Inf)*TOLX && Niter < MAXITER)
if(TOL*norm(x,Inf) > realmin)
TOLX = norm(x,Inf)*TOL;
else
TOLX = realmin;
end
x0 = x;
x = B*x0 + C;
Niter=Niter+1;
end
end
Niter=Niter{1};
x=gather(x);
end
Below there are the tests
%sequential Jacobi
format long;
A = gallery('poisson',20);
tic;
x= jacobi(A,ones(400,1),1e-6,2000000);
toc;
Elapsed time is 0.009054 seconds.
%parallel Jacobi
format long;
A = gallery('poisson',20);
tic;
x= Parallel_Jacobi(A,ones(400,1),1e-6,2000000);
toc;
Elapsed time is 11.484130 seconds.
I timed the parpool
function with 1,2,3 and 4 workers (I have a quad core processor) and the result is the following:
%Test
format long;
A = gallery('poisson',20);
delete(gcp('nocreate'));
tic
%parpool(1/2/3/4) means that i executed 4 tests that differ only for the
%argument in the function: first parpool(1), second parpool(2) and so on.
parpool(1/2/3/4);
toc
tic;
x= Parallel_Jacobi(A,ones(400,1),1e-6,2000000);
toc;
4 workers: parpool=13.322899 seconds, function=23.772271
3 workers: parpool=10.911769 seconds, function=16.402633
2 workers: parpool=9.371729 seconds, function=12.945154
1 worker: parpool=8.460357 seconds, function=7.982958 .
The less workers, the better the time is. Which is, like @Adriaan said, likely due overhead.
Does this mean that, in this case, the sequential function is always faster than the parallel function? Or is there a better way to implement the parallel one?
In this question it is said that the performance in parallel is better when the number of iterations is high. In my case, with this test, there are only 32 iteration.
The sequential implementation of Jacobi's method is this:
function [x,niter,resrel] = jacobi(A,b,TOL,MAXITER)
n = size(A,1);
D = 1./spdiags(A,0);
B = speye(n)-A./spdiags(A,0);
C= D.*b;
x0=sparse(zeros(length(A),1));
x = B*x0 + C;
Niter = 1;
TOLX = TOL;
while(norm(x-x0,Inf) > norm(x0,Inf)*TOLX && Niter < MAXITER)
if(TOL*norm(x,Inf) > realmin)
TOLX = norm(x,Inf)*TOL;
else
TOLX = realmin;
end
x0 = x;
x = B*x0 + C;
Niter=Niter+1;
end
end
I timed the code with the timeit function and the results are these(the inputs are the same of the previous):
4 workers: 11.693473075964102
3 workers: 9.221281335264003
2 workers: 9.150417240778545
1 worker: 6.047181992020434
sequential: 0.002893932969688