Wrong results and crashes from Fortran subroutine with OpenMP

Question

I wrote the following code, and then tried using OpenMP to parallelise it. However, after I compiled the following OpenMP code using f2py, Python always generates certain errors when I run it. There are no error messages, only that the numbers are a bit off, and whenever I compile it with f2py and run it in Python, it kills the kernel.

I am wondering if this has anything to do with my parallel region. I am always a bit confused about which variables to take private so can anyone observe any errors?

subroutine simulate_omp(m,nt,s0,l,d,a,numthreads,x,y,psi)
!Pedestrian motion model
!input variables:
!n = m^2 = number of students
!nt: number of time steps
!s0: student speed
!l: initial spacing between students
!d: student motion influence by all other students within distance <= d
!a: noise amplitude
!numthreads: number of threads to use in parallel regions
!output variables:
!x,y: all m^2 student paths from i=1 to nt+1
!psi: synchronization parameter, store at all nt+1 times (including initial 
condition)
use omp_lib
implicit none
integer, intent(in) :: m,nt,numthreads
real(kind=8), intent(in) :: s0,l,d,a
real(kind=8), dimension(m*m,nt+1), intent(out) :: x,y
real(kind=8), dimension(nt+1), intent(out) :: psi
real(kind=8), dimension(m*m,nt+1) :: xtemp,ytemp,u,v
real(kind=8), dimension(m*m,nt) :: usum,vsum,umean,vmean
real(kind=8) :: r(m*m)
real(kind=8),parameter :: pi = 4*atan(1.0_8)
integer :: i1,j1,k1,i2,j2,k2,count

!$call omp_set_num_threads(numthreads)
! initialize student positions
x = 0.d0
y = 0.d0
k1 = 0
do i1 = 1,m
  do j1=1,m
    k1 = k1 + 1
    x(k1,1) = (j1-1)*l/2 - (m-1)*l/4
    y(k1,1) = (i1-1)*l/2 - (m-1)*l/4
  end do
end do
x(:,1) = x(:,1)/(m-1)
y(:,1) = y(:,1)/(m-1)

! initialize
xtemp(:,1) = x(:,1)
ytemp(:,1) = y(:,1)
call random_number(r)
u(:,1) = s0*cos(r*2*pi-pi)
v(:,1) = s0*sin(r*2*pi-pi)
psi(1) = sqrt(sum(u(:,1))**2+sum(v(:,1)**2))/dble(m)/dble(m)/s0

do i2 = 1,nt
  !$OMP parallel do private(j2,k2,l)
  do j2 = 1,m*m
    usum(j2,i2) = 0
    vsum(j2,i2) = 0
    count = 0
    !$OMP parallel do reduction(+:usum,vsum,count)
    do k2 = 1,m*m
      if ((xtemp(k2,i2)-xtemp(j2,i2))**2+(ytemp(k2,i2)-ytemp(j2,i2))**2<=d**2) 
then
        usum(j2,i2) = usum(j2,i2)+u(k2,i2)
        vsum(j2,i2) = vsum(j2,i2)+v(k2,i2)
        count = count+1
      end if
    end do
    !$OMP end parallel do
    umean(j2,i2) = usum(j2,i2)/dble(count)
    vmean(j2,i2) = vsum(j2,i2)/dble(count)
    u(j2,i2+1) = s0*cos(atan(vmean(j2,i2)/umean(j2,i2))+a*(r(j2)*2*pi-pi))
    v(j2,i2+1) = s0*sin(atan(vmean(j2,i2)/umean(j2,i2))+a*(r(j2)*2*pi-pi))
    xtemp(j2,i2+1) = xtemp(j2,i2)+u(j2,i2+1)
    ytemp(j2,i2+1) = ytemp(j2,i2)+v(j2,i2+1)

    ! boundary conditions
    if (xtemp(j2,i2+1)>l) then
      xtemp(j2,i2+1) = xtemp(j2,i2+1)-2*l
    else
      if (xtemp(j2,i2+1)<-l) then
        xtemp(j2,i2+1) = xtemp(j2,i2+1)+2*l
      end if
    end if
    if (ytemp(j2,i2+1)>l) then
      ytemp(j2,i2+1) = ytemp(j2,i2+1)-2*l
    else
      if (ytemp(j2,i2+1)<-l) then
        ytemp(j2,i2+1) = ytemp(j2,i2+1)+2*l
      end if
    end if
  end do
  !$OMP end parallel do
  psi(i2+1) = sqrt(sum(u(:,i2+1))**2+sum(v(:,i2+1))**2)/dble(m)/dble(m)/s0
end do
x(:,1:nt+1) = xtemp(:,1:nt+1)
y(:,1:nt+1) = ytemp(:,1:nt+1)

end subroutine simulate_omp

Apparently, you would have a race on count in your outer parallel loop. Parallelizing the inner loops with array reduction looks inefficient. — tim18, Dec 13 '17 at 10:22
Sorry could you please elaborate on this? I am still new to openMP so I am unsure about parallelising nested loops — Febday, Dec 13 '17 at 11:58
Please read [ask]. Do not use formulations like *"certain errors"*, show the errors. Show all error messages, tell us which command generated them. If you get wrong results, show them and explain why they are wrong and how should correct results look like. It really is important. — Vladimir F Героям слава, Dec 14 '17 at 12:24
BTW, not that using `kind=8` is very ugly and not portable. Your code will not compile with several compilers. The number 8 does NOT universally mean 8-bytes. See https://stackoverflow.com/questions/838310/fortran-90-kind-parameter for more — Vladimir F Героям слава, Dec 14 '17 at 12:26
There are no error messages, only that the numbers are a bit off, and whenever I compile it with f2py and run it in python, it kills the kernel. — Febday, Dec 14 '17 at 21:07
Always [edit] the question with important information, don't put them into comments. Are there no messages when the kernel is killed? — Vladimir F Героям слава, Dec 16 '17 at 08:21

user1824346 · Answer 1 · 2017-12-16T06:06:58.267

The argument l is declared with intent(in) and not modified in the loop so there is no need to declare it private. Below is a suggestion without the outer parallel loop:

subroutine simulate_omp(m,nt,s0,l,d,a,numthreads,x,y,psi)
!Pedestrian motion model
!input variables:
!n = m^2 = number of students
!nt: number of time steps
!s0: student speed
!l: initial spacing between students
!d: student motion influence by all other students within distance <= d
!a: noise amplitude
!numthreads: number of threads to use in parallel regions
!output variables:
!x,y: all m^2 student paths from i=1 to nt+1
!psi: synchronization parameter, store at all nt+1 times (including initial 
condition)
use omp_lib
implicit none
integer, intent(in) :: m,nt,numthreads
real(kind=8), intent(in) :: s0,l,d,a
real(kind=8), dimension(m*m,nt+1), intent(out) :: x,y
real(kind=8), dimension(nt+1), intent(out) :: psi
real(kind=8), dimension(m*m,nt+1) :: xtemp,ytemp,u,v
real(kind=8), dimension :: usum,vsum,umean,vmean
real(kind=8) :: r(m*m)
real(kind=8),parameter :: pi = 4*atan(1.0_8)
integer :: i1,j1,k1,i2,j2,k2,count

!$call omp_set_num_threads(numthreads)
! initialize student positions
x = 0.d0
y = 0.d0
k1 = 0
do i1 = 1,m
  do j1=1,m
    k1 = k1 + 1
    x(k1,1) = (j1-1)*l/2 - (m-1)*l/4
    y(k1,1) = (i1-1)*l/2 - (m-1)*l/4
  end do
end do
x(:,1) = x(:,1)/(m-1)
y(:,1) = y(:,1)/(m-1)

! initialize
xtemp(:,1) = x(:,1)
ytemp(:,1) = y(:,1)
call random_number(r)
u(:,1) = s0*cos(r*2*pi-pi)
v(:,1) = s0*sin(r*2*pi-pi)
psi(1) = sqrt(sum(u(:,1))**2+sum(v(:,1)**2))/dble(m)/dble(m)/s0

do i2 = 1,nt
  do j2 = 1,m*m
    usum = 0
    vsum = 0
    count = 0
    !$OMP parallel do private(k2), reduction(+:usum,vsum,count)
    do k2 = 1,m*m
      if ((xtemp(k2,i2)-xtemp(j2,i2))**2+(ytemp(k2,i2)-ytemp(j2,i2))**2<=d**2) then
        usum = usum+u(k2,i2)
        vsum = vsum+v(k2,i2)
        count = count+1
      end if
    end do
    !$OMP end parallel do
    umean = usum/dble(count)
    vmean = vsum/dble(count)
    u(j2,i2+1) = s0*cos(atan(vmean/umean)+a*(r(j2)*2*pi-pi))
    v(j2,i2+1) = s0*sin(atan(vmean/umean)+a*(r(j2)*2*pi-pi))
    xtemp(j2,i2+1) = xtemp(j2,i2)+u(j2,i2+1)
    ytemp(j2,i2+1) = ytemp(j2,i2)+v(j2,i2+1)

    ! boundary conditions
    if (xtemp(j2,i2+1)>l) then
      xtemp(j2,i2+1) = xtemp(j2,i2+1)-2*l
    else
      if (xtemp(j2,i2+1)<-l) then
        xtemp(j2,i2+1) = xtemp(j2,i2+1)+2*l
      end if
    end if
    if (ytemp(j2,i2+1)>l) then
      ytemp(j2,i2+1) = ytemp(j2,i2+1)-2*l
    else
      if (ytemp(j2,i2+1)<-l) then
        ytemp(j2,i2+1) = ytemp(j2,i2+1)+2*l
      end if
    end if
  end do
  psi(i2+1) = sqrt(sum(u(:,i2+1))**2+sum(v(:,i2+1))**2)/dble(m)/dble(m)/s0
end do
x(:,1:nt+1) = xtemp(:,1:nt+1)
y(:,1:nt+1) = ytemp(:,1:nt+1)

end subroutine simulate_omp

You can time it and compare it with the outer loop parallelised using private(j2,k2,umean,vmean,usum,vsum,count), shared(u,v,xtemp,ytemp). Make sure to have OMP_NESTED set to true for the latter tests.

Thanks a lot, but does it mean that whenever there is a nested loop, we only need to parallelise the innermost loop? — Febday, Dec 14 '17 at 21:09
@VladimirF there is probably no specification: it just looked strange to me. As you said, it should not lead to an error. — user1824346, Dec 16 '17 at 05:48
@Febday as a previous comment evocated a race condition, I have skipped the outer parallel loop. You can add it with private(j2,k2,umean,vmean,usum,vsum,count) and time the execution. Make sure to have OMP_NESTED set to true. — user1824346, Dec 16 '17 at 05:51
@FebdayYou may also add shared(u,v,xtemp,ytemp) to this outer parallel loop. I have updated the answer accordingly. — user1824346, Dec 16 '17 at 05:59

Wrong results and crashes from Fortran subroutine with OpenMP

1 Answers1