0

I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.

!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90

  PROGRAM minExample
     USE omp_lib
     IMPLICIT NONE
     INTEGER                        :: n_chars, real_alloced
     INTEGER                        :: nx,ny,nz,ix,iy,iz, idx
     INTEGER                        :: nthreads, lasteinstellung,i 
     INTEGER, PARAMETER             :: dp = kind(1.0d0)
     REAL (KIND = dp)               :: j
     CHARACTER(LEN=32)              :: arg

     nx             = 2
     ny             = 2
     nz             = 2
     lasteinstellung= 10000
     CALL getarg(1, arg)
     READ(arg,*) nthreads
     CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
     nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
     WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"

    n_chars = 0
    idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp   shared(nx,ny,nz,n_chars) &
!$omp   private(ix,iy,iz, idx) &
!$omp   private(lasteinstellung,j) !&  
    DO iz=-nz,nz
       DO iy=-ny,ny
          DO ix=-nx,nx
!                  WRITE(*,*) ix,iy,iz
             j = 0.0d0
             DO i=1,lasteinstellung
                j = j + real(i)
             END DO
!$omp critical
             n_chars = n_chars + 1               
            idx = n_chars                       
!$omp end critical
          END DO
       END DO
    END DO
  END PROGRAM

I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine. What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?

edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.

Jannek S.
  • 365
  • 3
  • 16
  • 3
    Welcome at StackOverflow. If you have a problem with a code you should create a Minimal, Complete, and Verifiable example http://stackoverflow.com/help/mcve. Which we can try to run ourselves. – Vladimir F Героям слава Jul 15 '16 at 12:45
  • Does this "other machine" have hyperthreading? [Threads running on two logical cores of the same physical CPU core can communicate much more quickly than on separate physical cores](http://stackoverflow.com/questions/32979067/what-will-be-used-for-data-exchange-between-threads-are-executing-on-one-core-wi/32981256). I don't know Fortran, so I didn't even try to skim the code to see if this might be part of the explanation. – Peter Cordes Jul 17 '16 at 12:55

1 Answers1

2

If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop

                 DO i=1,loadSet
                    j = j + real(i)
                 END DO

can take a completely arbitrary number of iterations.

If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.