With OpenMP parallelized nested loops run slow

Question

I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.

integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col
double complex, dimension(4,4):: mat
double complex, dimension(:), allocatable :: vecin,vecout 

nstates = 2
N = 24

allocate(vecin(nstates**N), vecout(nstates**N))
vecin = ...some data
vecout = 0

mat = reshape([...some data...],[4,4])

dimb=nstates**2

!$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col) 
do i=1,N-1
    dima=nstates**(i-1)
    dimc=nstates**(N-i-1)

    do a_row = 1, dima
        do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                do b_col = 1,dimb
                    col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                    !$OMP ATOMIC
                    vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
                end do
            end do
        end do
    end do
end do
!$OMP END PARALLEL DO

The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.

You should take a look at the `reduction` clause for `vecout`. This should speed things up ;-) — Alexander Vogt, Apr 01 '15 at 08:14
Using an atomic instruction in the most inner loop cannot be a good idea ! — Francois Jacq, Apr 01 '15 at 10:15
Thanks for your answers. Why is the atomic instruction a bad idea? I deleted the atomic instruction and used REDUCTION(+:vecout) instead, which leads to a segmentation fault. Is there something special about reduction and arrays? — Yannick, Apr 01 '15 at 10:37
The reduction works without segmentation fault just for small arrays. — Yannick, Apr 01 '15 at 10:50
Then you need to increase the stack size: `ulimit -s unlimited` — Alexander Vogt, Apr 01 '15 at 11:06

Vladimir F Героям слава · Accepted Answer · 2015-04-01T12:28:11.227

2

If your arrays are too large and you get stack overflows with automatic reduction, you can implement the reduction yourself with allocatable temporary arrays.

As Francois Jacq pointed out, you also have a race condition caused by dima and dimb which should be private.

double complex, dimension(:), allocatable :: tmp

!$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)

allocate(tmp(size(vecout)))
tmp = 0

!$OMP DO
do i=1,N-1
    dima=nstates**(i-1)
    dimc=nstates**(N-i-1)

    do a_row = 1, dima
        do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                do b_col = 1,dimb
                    col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                    tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col)
                end do
            end do
        end do
    end do
end do
!$OMP END DO

!$OMP CRITICAL
vecout = vecout + tmp
!$OMP END CRITICAL
!$OMP END PARALLEL

edited Apr 01 '15 at 12:28

answered Apr 01 '15 at 11:41

Vladimir F Героям слава

57,977
4
76
119

Thanks, that worked. But why is this method faster than using atomic? – Yannick Apr 01 '15 at 12:04
the variables dima and dimc should be private – Francois Jacq Apr 01 '15 at 12:16
Yes, I recognized this before and changed it now in my question. – Yannick Apr 01 '15 at 12:32
From my point of view, they are still race conditions because two threads could have the same row index. See my proposed solution. – Francois Jacq Apr 01 '15 at 13:01
@FrancoisJacq They can have the same row index, but they write to a different array. Your solution is good, but there is a lot of synchronization on each `!$omp parallel do`. The final decision should come from performance measurement. – Vladimir F Героям слава Apr 01 '15 at 13:05
@Vladimir F Ah yes your are right, I did'nt see that tmp was declared private... Sorry – Francois Jacq Apr 01 '15 at 14:59

score 1 · Answer 2 · answered Apr 01 '15 at 12:12

1

Could you try something like :

do b_col=1,dimb
   do i=1,N-1
      dima=nstates**(i-1)
      dimc=nstates**(N-i-1)
      !$OMP PARALLEL DO COLLAPSE(3) PRIVATE(row,col,a_row,b_row,c_row)
      do a_row = 1, dima
         do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
            enddo
         enddo
      enddo
   enddo
enddo

The advantage is that the // loop does not cause collision now : all the indexes row are different.

answered Apr 01 '15 at 12:12

Francois Jacq

1,244
10
18

Thanks for the answer. I will measure the time of your and Vladimir F's solution to compare them. What I can say until now is, that your solution runs faster for me without COLLAPSE. Can you tell my why you put the b_col loop out of the others? The row index should be different for every thread also when the b_col loop is inside, shouldn't it? – Yannick Apr 01 '15 at 19:24
the loop b_col is outside the // loop two avoid that threads could compute the same row index => race situation when modifying vecout (risk of random mistake). Why did you delete the COLLAPSE clause ? It just indicates that the three next loops may be merged in a single // one. – Francois Jacq Apr 02 '15 at 07:06
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row , so row does not depend on b_col. To your collapse question: First I tried with collapse and then without. Without it was faster so why should I use it? – Yannick Apr 02 '15 at 07:43
because row does not depend on b_col, the same row index may be obtained b_col times. About COLLAPSE, if you have measured with and without then all is OK : the judge is always the measurement ! – Francois Jacq Apr 02 '15 at 16:55
In my previous message, read "dimb times" instead of "b_col times" – Francois Jacq Apr 02 '15 at 17:46

With OpenMP parallelized nested loops run slow

2 Answers2

Linked