Poor scaling and a segmentation fault in a Fortran OpenMP code

Question

I'm having some trouble when executing a program with a parallel do. Here is a test code.

module test
use, intrinsic :: iso_fortran_env, only: dp => real64
implicit none
contains

subroutine Addition(x,y,s)
    real(dp),intent(in) :: x,y
    real(dp), intent(out) :: s
    s = x+y
end subroutine Addition

function linspace(length,xi,xf) result (vec)
! function to create an equally spaced vector given a begin and end point
    real(dp),intent(in) :: xi,xf
    integer, intent(in) :: length
    real(dp),dimension(1:length) :: vec
    integer ::i
    real(dp) :: increment

    increment = (xf-xi)/(real(length)-1)
    vec(1) = xi
    do i = 2,length
        vec(i) = vec(i-1) + increment
    end do
end function linspace
end module test

program paralleltest
use, intrinsic :: iso_fortran_env, only: dp => real64
use test
use :: omp_lib
implicit none
integer, parameter :: length = 1000
real(dp),dimension(length) :: x,y
real(dp) :: s
integer:: i,j
integer :: num_threads = 8
real(dp),dimension(length,length) :: SMatrix

 x = linspace(length,.0d0,1.0d0)
 y = linspace(length,2.0d0,3.0d0)

!$ call omp_set_num_threads(num_threads)
!$OMP PARALLEL DO
do i=1,size(x)
    do j = 1,size(y)
    call Addition(x(i),y(j),s)
    SMatrix(i,j) = s
    end do
end do
!$OMP END PARALLEL DO

open(unit=1,file ='Add6.dat')
do i= 1,size(x)
    do j= 1,size(y)
        write(1,*) x(i),";",y(j),";",SMatrix(i,j)
    end do
end do
close(unit=1)
end program paralleltest

I'm running the program in the following waygfortran-8 -fopenmp paralleltest.f03 -o pt.out -mcmodel=medium and then export OMP_NUM_THREADS=8 This simple code brings me at least two big questions on parallel do. The first is that if I run with length = 1100 or greater, I have Segmentation fault (core dump) error message but with smaller values it runs with no problem. The second is about the time it takes. When I run it with length = 1000 (run with time ./pt.out) the time it takes is 1,732s but if I run it in a sequential way (without calling the -fopenmplibrary and with taskset -c 4 time./pt.out ) it takes 1,714s. I guess the difference between both ways arise in a longer and more complex code where parallel is more usefull. In fact when I tried it with more complex calculations running in parallel with eight threads, time was reduced at half that it took in sequential but not an eighth as I expected. In view of this my questions are, is any optimization available always or is it code dependent? and second, is there a friendly way to control which thread runs which iteration? That is the first running the first length/8 iteration, and so on, like performing several taskset 's with different code where in each is the iteration that I want.

For the segfault it is a well known problem, we have many duplicates https://stackoverflow.com/questions/13264274/why-segmentation-fault-is-happening-in-this-openmp-code Use an allocatable array. You will find many similar errors by simply searching here. I suggest removing that bit from your question. — Vladimir F Героям слава, Mar 11 '19 at 16:05
Seems to me that you need "private(s)". Not that that will affect a SEGV, but it will affect the result! — Jim Cownie, Mar 12 '19 at 05:40
Side comment: Fortran is [column-major](https://en.wikipedia.org/wiki/Row-_and_column-major_order), while your inner loop works on rows of SMatrix. Not cache-friendly. — , Mar 12 '19 at 05:44
@JimCownie the private attribute should be given in the module or in the main part of the program? — Daniel, Mar 13 '19 at 13:51
@Daniel On the omp parallel. You want each thread to have its own copy of s. The way your code is now there is only a single, shared, copy, so when multiple threads execute the loops they are all assigning to the same 's', and overwriting each others results which may happen before they have stored the result. Another approach would be just to eliminate 's' completely and pass Smatrix(i,j) as the final argument to Addition... — Jim Cownie, Mar 14 '19 at 10:01

Vladimir F Героям слава · Accepted Answer · 2019-03-12T09:35:01.967

As I commented, the Segmentation fault has been treated elsewhere Why Segmentation fault is happening in this openmp code?, I would use an allocatable array, but you can also set the stacksize using ulimit -s.

Regarding the time, almost all of the runtime is spent in writing the array to the external file.

But even if you remove that and you measure the time only spent in the parallel section using omp_get_wtime() and increase the problem size, it still does not scale too well. This because there is very little computation for the CPU to do and a lot of array writing to memory (accessing main memory is slow - cache misses).

As Jean-Claude Arbaut pointed out, your loop order is wrong and makes accessing the memory even slower. Some compilers can change that for you with higher optimization levels (-O2 or -O3), but only some of them.

And even worse, as Jim Cownie pointed out, you have a race condition. Multiple threads try to use the same s for both reading and writing and the program is invalid. You need to make s private using private(s).

With the above fixes I get a roughly two times faster parallel section with four cores and four threads. Don't try to use hyper-threading, it slows the program down.

If you give the CPU more computational work to do, like s = Bessel_J0(x)/Bessel_J1(y) it scales pretty well for me, almost four times faster with four threads, and hyper threading does speed it up a little bit.

Finally, I suggest just removing the manual setting of the number of threads, it is a pain for testing. If you remove that, you can use OMP_NUM_THREADS=4 ./a.out easily.

@King Well, I did not really go the way of really disabling it, it must be done in the kernel or BIOS. I just used 4 threads or 8 threads on a 4-core 8-hyperthread CPU. — Vladimir F Героям слава, Mar 16 '19 at 08:16

Poor scaling and a segmentation fault in a Fortran OpenMP code

1 Answers1