Does the following work for you (updated with ATOMIC clause to prevent problem identified by Massimiliano)
!$ OMP PARALLEL DO PRIVATE(k, ind, temp)
do i=1,lastcol
do k=ia(i),ia(i+1)-1
ind=ja(k)
temp = x(i)*a(k)
!$ OMP ATOMIC
y(ind)=y(ind)+temp
!$ OMP END ATOMIC
end do
end do
!$ OMP END PARALLEL DO
This should divide the "work" of the outer loops over a number of different processors, whilst making sure that there are separate copies of the inner loop variables k
and ind
It's been a while since I have used OMP - if this doesn't work for you please use the comments to let me know. Meanwhile there is a very nice reference/tutorial here
Also - you will find a similar question was asked earlier - although the language was C, the basic loop structure was very similar. The conversation there suggests that when the matrix gets quite large (exceeding the size of the cache), the speedup from parallelization is minimal.