1

I'm trying to parallelize a code I've written. I'm having problems performing reducitons on arrays. It all seems to work fine for smallish arrays, however when the array size goes above a certain point I either get a stack overflow error or a crash.

I've tried to increased the stack size using the /F at compile time, I'm using ifort on windows, I've also tried passing set KMP_STACKSIZE=xxx the intel specific stacksize decleration. This sometimes helps and allows the code to progress further through my loop but in the end doesn't resolve the issue, even with a stack size of 1Gb or greater.

Below is a small self-contained working example of my code. It works in serial, and with one thread. Or with many threads but a small 'N'. A large N (i.e. like 250,000 in the example) causes problems.

I didn't think these arrays were so massive so as to cause major problems, and presumed increasing my stack space would help - are there any other options, or have I missed something important in my coding ?

program testreduction
    use omp_lib
    implicit none
    integer :: i, j, nthreads, Nsize
    integer iseed /3/
    REAL, allocatable :: A(:,:), B(:), C(:), posi(:,:)
    REAL :: dx, dy, r, Axi, Ayi, m, F
    !Set size of matrix, and main loop
    Nsize = 250000
    m = 1.0
    F = 1.0
    !Allocate posi array
    allocate(posi(2,Nsize))
    !Fill with random numbers
    do i=1,Nsize
        do j=1,2
            posi(j,i) = (ran(iseed))
        end do
    end do
    !Allocate other arrays
    allocate(A(2,Nsize), C(Nsize), B(Nsize))

    print*, sizeof(A)+sizeof(B)+sizeof(C)
    !$OMP parallel
    nthreads = omp_get_num_threads()
    !$OMP end parallel

    print*, "Number of threads ", nthreads
    !Go through each array and do some work, calculating a reduction on A, B and C.
    !$OMP parallel do schedule(static) private(i, j, dx, dy, r, Axi, Ayi) reduction(+:C, B, A)
    do i=1,Nsize
        do j=1,Nsize
            !print*, i
            dx = posi(1,i) - posi(1,j)
            dy = posi(2,i) - posi(2,j)
            r = sqrt(dx**2+dy**2)
            Axi = -m*(F)*(dx/(r))
            Ayi = -m*(F)*(dy/(r))
            A(1,i) = A(1,i) + Axi
            A(2,i) = A(2,i) + Ayi
            B(i) = B(i) + (Axi+Ayi)
            C(i) = C(i) + dx/(r) + dy/(r)
        end do    
    END DO
    !$OMP END parallel do

end program

UPDATE

A better example of what I'm talking about ..

program testreduction2
    use omp_lib
    implicit none
    integer :: i, j, nthreads, Nsize, q, k, nsize2
    REAL, allocatable :: A(:,:), B(:), C(:)
    integer, ALLOCATABLE :: PAIRI(:), PAIRJ(:)

    Nsize = 25
    Nsize2 = 19
    q=0

    allocate(A(2,Nsize), C(Nsize), B(Nsize))
    ALLOCATE(PAIRI(nsize*nsize2), PAIRJ(nsize*nsize2))

    do i=1,nsize
        do j =1,nsize2
            q=q+1
            PAIRI(q) = i
            PAIRJ(q) = j
        end do
    end do

    A = 0
    B = 0
    C = 0

    !$OMP parallel do schedule(static) private(i, j, k)
    do k=1,q
        i=PAIRI(k)
        j=PAIRJ(k)
        A(1,i) = A(1,i) + 1
        A(2,i) = A(2,i) + 1
        B(i) = B(i) + 1
        C(i) = C(i) + 1
    END DO
    !$OMP END parallel do

    PRINT*, A
    PRINT*, B
    PRINT*, C       
END PROGRAM

1 Answers1

2

The problem is that you are reducing really large arrays. Note that other languages (C, C++) could not reduce arrays until OpenMP 4.5.

But I don't see any reason for the reduction in your case, you update each element only once.

Try just

!$OMP parallel do schedule(static) private(i, dx, dy, r, Axi, Ayi)
do i=1,Nsize
  do j=1,Nsize
    ...
    A(1,i) = A(1,i) + Axi
    A(2,i) = A(2,i) + Ayi
    B(i) = B(i) + (Axi+Ayi)
    C(i) = C(i) + dx/(r) + dy/(r)
  end do
end do
!$OMP END parallel do

The point is the threads do not interfare. Every thread uses different set of is and therefore different elements of A, B and C.

Even if you come up with a case where it seems to be necessary, you can always rewrite it to avoid it. You can even allocate some buffers yourself and simulate the reduction. Or use atomic updates.

  • In this example you are correct, I don't need the reduction, however in the real code I do. This was just the simplest example I could come up with that failed in the samew way. – MechanicalEagle36 Jul 22 '14 at 09:33
  • There is still no need for the reduction in your update. If you have something different, show it. – Vladimir F Героям слава Jul 22 '14 at 11:15
  • I've added another code example, which better demonstrates what I'm actually doing. In this case, I've **not** done a reduction just to see what happens. The answer for each element in each array should be 19. Curiously this actually seems to work in the above example. I'm not sure why though. In this example what is stopping different threads having the same value of 'i', and why doesn't this cause problems ? – MechanicalEagle36 Jul 22 '14 at 12:45
  • 1
    Still too simple, `PAIRI` depends on `i` only and you do not use `j`. No need for reduction. Show the **actual** code. – Vladimir F Героям слава Jul 22 '14 at 12:49
  • 1
    Maybe I'm not getting this, but if two threads both have the same value of i and try write to the same element isn't this a problem? – MechanicalEagle36 Jul 22 '14 at 13:26
  • They can't have the same value of `i`, that's the main principle of OpenMP loop work-sharing. Every threads has a distinct set of indexes to work with. And the loop is executed only once for each index. – Vladimir F Героям слава Jul 22 '14 at 13:26
  • In my last example the loop index is k, not i. So as far as i can tell there is nothing stopping two threads with different k having identical i. – MechanicalEagle36 Jul 22 '14 at 13:32
  • I know, note that the part initializing `PAIRI` is indented in a very bad way (as all your code - never mix spaces and tabs) and not much readable. I missed one of the lines there. But what I write holds, I would always avoid reducing such large arrays and used my own private buffers or `atomic` updates. – Vladimir F Героям слава Jul 22 '14 at 13:47
  • I know, sorry about that, it didn't seem to paste on very well and got a bit garbled. I've implemented a private buffer type solution that I think will work, thanks for all your help, much appreciated. – MechanicalEagle36 Jul 22 '14 at 13:57
  • 1
    Strange, all big arrays are allocatable. Listing them in a reduction clause should not lead to exhaustion of the stack space as the private copies are supposed to be allocatable too. – Hristo Iliev Jul 22 '14 at 14:45
  • I am not sure it is that certain what wil the private copies be. – Vladimir F Героям слава Jul 22 '14 at 14:47