1

I'm doing an n body simulation and I've an array which stores the acceleration of particles. Now I'm trying to parallelize my code and I've met with this problem, since the array is of type real, I'm not able to lock it. Because the init_lock subroutine in OpenMP, the argument must be an integer.

Is there a way around? I used the critical construct but it has no effect on the time of computation. Here I tried locking the array indices, but it is not working either.

    call omp_init_lock(i,j)

    !$omp parallel 
    !$omp do private(rx,ry,rz,rsqd,r2i,r6i,virij,ff,i,j) schedule(dynamic)
    do i=1,m-1
        do j=i+1,m
            rsqd = 0.0
                rx = x(i,1) - x(j,1)
                ry = x(i,2) - x(j,2)
                rz = x(i,3) - x(j,3)                                    
                rsqd = rsqd + rx*rx +ry*ry + rz*rz
                !calculation of another variable ff
                ! $omp critical
                call omp_set_lock(i)
                    a(i,1) = a(i,1) + rx*ff
                    a(i,2) = a(i,2) + ry*ff
                    a(i,3) = a(i,3) + rz*ff
                call omp_unset_lock(i)
                call omp_set_lock(j)                                    
                    a(j,1) = a(j,1) - rx*ff
                    a(j,2) = a(j,2) - ry*ff
                    a(j,3) = a(j,3) - rz*ff 
                call omp_unset_lock(j)  
                ! $omp end critical         
        end do
    end do
user1844
  • 25
  • 5

2 Answers2

0

One options is to use one critical section around the a(i,x) and a(j,x) statements and eliminate all the locks (they are being used redundantly).

So I'd suggest you do something like:

!$omp parallel 
!$omp do private(rx,ry,rz,rsqd,r2i,r6i,virij,ff,i,j) schedule(dynamic)
do i=1,m-1
    do j=i+1,m
        rsqd = 0.0
            rx = x(i,1) - x(j,1)
            ry = x(i,2) - x(j,2)
            rz = x(i,3) - x(j,3)                                    
            rsqd = rsqd + rx*rx +ry*ry + rz*rz
            !calculation of another variable ff
            ! $omp critical                   
                a(i,1) = a(i,1) + rx*ff
                a(i,2) = a(i,2) + ry*ff
                a(i,3) = a(i,3) + rz*ff
                a(j,1) = a(j,1) - rx*ff
                a(j,2) = a(j,2) - ry*ff
                a(j,3) = a(j,3) - rz*ff 
            ! $omp end critical         
    end do
end do
jandres742
  • 191
  • 11
  • I think your approach is not valid. You still have race conditions there. For instance, value in `a(3,1)` can be written by two different threads in which `i = 3` or `j = 3` and needs protection. – Harald Sep 07 '16 at 09:19
  • Oh yes, you are right. Missed that one. In that case, the critical section should start before the `a(i,x)` statements. Thanks – jandres742 Sep 07 '16 at 13:07
  • Probably you'd like to edit and update your answer then. – Harald Sep 07 '16 at 15:12
0

The locks you've found are data structures that allow only one thread executing the code within the defined code. If you'd like to protect the access on each cell of the matrix, you could go for a single lock for the whole matrix, to protect each cell, or something in between such as protecting accesses per row/column.

I would suggest you to use the omp atomic construct to each of the critical assignaments you have. The benefits of the atomic in contrast to omp critical and locks is that atomic can be implemented by the compiler/runtime by atomic processor instructions and thus increase the performance because they reduce the amount of instructions executed by a single processor (see openMP, atomic vs critical? and http://openmp.org/forum/viewtopic.php?f=3&t=1549). Another benefit using the omp atomic construct is that your code seemlessly compiles when OpenMP code is not enabled at compile time, which is not the case if you use locks.

This could be a variation of your code:

!$omp parallel 
!$omp do private(rx,ry,rz,rsqd,r2i,r6i,virij,ff,i,j) schedule(dynamic)
do i=1,m-1
    do j=i+1,m
        rsqd = 0.0
            rx = x(i,1) - x(j,1)
            ry = x(i,2) - x(j,2)
            rz = x(i,3) - x(j,3)                                    
            rsqd = rsqd + rx*rx +ry*ry + rz*rz
            !calculation of another variable ff
            !$omp atomic
            a(i,1) = a(i,1) + rx*ff
            !$omp atomic
            a(i,2) = a(i,2) + ry*ff
            !$omp atomic
            a(i,3) = a(i,3) + rz*ff
            !$omp atomic
            a(j,1) = a(j,1) - rx*ff
            !$omp atomic
            a(j,2) = a(j,2) - ry*ff
            !$omp atomic
            a(j,3) = a(j,3) - rz*ff 
    end do
end do

EDIT Since you're interested in obtaining further performance from this I'd suggest you to use some performance analysis tool (such as Vtune [1], Paraver [2], TAU [3], Scalasca [4] and even gprof). That being said, the small impact you mention when parallelizing this code may indicate that this loop represents a small fraction of the total execution runtime. Performance analysis tools would help you identify which are the most time consuming routines.

If this portion of code is representative, it may happen that the cost of creating the parallel region and waiting for it and/or the serialization due to the critical/atomic regions might limit the speedup you obtain. For the first, you might consider providing a minimum chunk to the dynamic scheduling (e.g. dynamic, 4). That would help on last iterations in which i is close to m and thus the do j loop won't represent a huge amount of work. You can also consider the guided scheduling. Its behavior is somehow similar to dynamic but the number of iterations (chunk) decreases dynamically -- i.e.

[1] https://software.intel.com/en-us/intel-vtune-amplifier-xe

[2] http://www.bsc.es/computer-sciences/performance-tools/paraver

[3] http://www.cs.uoregon.edu/research/tau/home.php

[4] http://www.scalasca.org

Community
  • 1
  • 1
Harald
  • 3,110
  • 1
  • 24
  • 35
  • `@Harald` Use of `atomic` construct is giving better performance than the `critical` construct but in both cases my serial program is running faster than the two. Do you have any idea what could be the possible reason for this? – user1844 Sep 07 '16 at 12:55
  • WRT performance that depends on multiple factors. I'll try to update my answer trying to cover that. – Harald Sep 07 '16 at 15:13
  • `@Harald` Thanks a lot. The performance has increased now, I've changed all the 2D arrays in the code to multiple 1D arrays(i.e. `a(i,j)` to `ax(i)`, `ay(i)` and `az(i)`). Now my parallel code is running faster than the serial one. But reason remains unknown to me... – user1844 Sep 08 '16 at 06:01