Fortran OpenMP code much slower than its not parallel version

Question

I want to solve the Random Walk problem, so i wrote a fortran sequental code and now i need to parallel this code.

subroutine random_walk(walkers)

implicit none
include "omp_lib.h"
integer :: i, j, col, row, walkers,m,n,iter
real, dimension(:, :), allocatable :: matrix, res
real :: point, z


col = 12
row = 12


allocate (matrix(row, col), res(row, col))

! Read from file
open(2, file='matrix.txt')
    do i = 1, row
        read(2, *)(matrix(i, j), j=1,col)
    end do

res = matrix


! Solve task

!$omp parallel private(i,j,m,n,point,iter) 

!$omp do collapse(2) 

do i= 2, 11        
    do j=2, 11  

        m = i
        n = j
        iter = 1
        point = 0

        do while (iter <= walkers)
            call random_number(z)
            if (z <= 0.25) m = m - 1
            if (z > 0.25 .and. z <= 0.5) n = n +1
            if (z > 0.5 .and. z <= 0.75) m = m +1
            if (z > 0.75) n = n - 1

            if (m == 1 .or. m == 12 .or. n == 1 .or. n == 12) then 
                point = point + matrix(m, n)
                m = i
                n = j
                iter = iter + 1
            end if

        end do
        point = point / walkers           

        res(i, j) = point    
    end do        
end do

!$omp end do
!$omp end parallel    

! Write to file
open(2, file='out_omp.txt')
    do i = 1, row
        write(2, *)(res(i, j), j=1,col)
    end do    
contains    

end

So, the problem is that parallel program computes MUCH lesser than its sequential version. Where is the mistake?(except my terrible code)

Update: for now the code is with !$omp do directives, but the result is still the same: it is much lesser than its sequential version.

It is hard for others to help you without a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). Also, it would help if you could state with more detail your goals and design choices. — Pierre de Buyl, Mar 24 '18 at 20:52
a better option is to use several `!$omp section` and get rid of the (ugly imho) `if (thrd_num ...)` — Gilles Gouaillardet, Mar 25 '18 at 01:49
I'm trying to make each thread calculate the separate matrix fragment. Thread 0 is for (2:6;2:6), Thread 1 is for (7:11;2:6), Thread 2 is for (2:6;7:11), Thread 3 is for (7:11;7:11). That is why i need this ugly 'if (thrd_num ...)'. — Dmitry, Mar 25 '18 at 09:03
unless you want to control the mapping (e.g. which thread does what) which I doubt, you do not need this. anyway, just use OpenMP sections correctly, see https://stackoverflow.com/questions/2770911/how-does-the-sections-directive-in-openmp-distribute-work for an example — Gilles Gouaillardet, Mar 25 '18 at 09:48
thanks, now sections work correctly, but the problem is still there: the time for sequental is 0.866 sec and for parallel is 3.430 sec, it seems to be that sections no matter on commands work sequentally — Dmitry, Mar 25 '18 at 10:10
Not really something that can be compiled and benchmarked ... anyway, why don't you use a standard `omp do` loop ? You want to swap the `i` and `j` loops, and `collapse` might help too. Note `random_number()` might implicitly serialize the threads, and hence kill performance, a simple benchmark can clear that up. — Gilles Gouaillardet, Mar 25 '18 at 13:29
Is there any reason you can't stick this all in a subroutine (except the sections), make arrays of the arguments and loop over calling the subroutine with different elements of the array? If I've understood you correctly this could then be parallelised with a simple !$ omp do, and you wouldn't be stuck with a fixed number of sections. — Ian Bush, Mar 25 '18 at 16:22

score 3 · Accepted Answer · answered Mar 25 '18 at 23:19

Most probably, the behavior is related to the random number extraction. RANDOM_NUMBER Fortran procedure is not even guaranteed to be thread-safe but it is thread-safe at least in GNU compiler thanks to a GNU extension. But in any case the performances seem to be very bad as you note.

If you switch to a different thread-safe random number generator, the scalability of your code can be good. I used the classical ran2.f generator:

http://www-star.st-and.ac.uk/~kw25/research/montecarlo/ran2.f

modified to make it thread-safe. If I am not wrong, to do that:

in the calling unit declare and define:

integer :: iv(32), iy, idum2, idum

idum2 = 123456789 ; iv(:) = 0 ; iy = 0
in OpenMP directives add idum as private and idum2, iv, iy as firstprivate (by the way you need to add z as private too)
in the parallel section add (before do) idum = - omp_get_thread_num() to have different random numbers for different threads
from ran2 function remove DATA and SAVE lines e pass idum2, iv, iy as arguments:

FUNCTION ran2(idum, iv, iy, idum2)
call ran2 instead of random_number intrinsic z = ran2(idum, iv, iy, idum2)

With walkers=100000 (GNU compiler) these are my times:

1 thread   => 4.7s
2 threads  => 2.4s
4 threads  => 1.5s
8 threads  => 0.78s
16 threads => 0.49s

Not strictly related to the question but I have to say that extracting a real number for each 4 "bit"s info you need (+1 or -1) and the usage of conditionals can be probably changed using a more efficient strategy.

Fortran OpenMP code much slower than its not parallel version

1 Answers1