Code takes much more time to finish with more than 1 thread

Question

I want to benchmark some Fortran code with OpenMP-threads with a critical-section. To simulate a realistic environment I tried to generate some load before this critical-section.

!Kompileraufruf: gfortran -fopenmp -o minExample.x minExample.f90

  PROGRAM minExample
     USE omp_lib
     IMPLICIT NONE
     INTEGER                        :: n_chars, real_alloced
     INTEGER                        :: nx,ny,nz,ix,iy,iz, idx
     INTEGER                        :: nthreads, lasteinstellung,i 
     INTEGER, PARAMETER             :: dp = kind(1.0d0)
     REAL (KIND = dp)               :: j
     CHARACTER(LEN=32)              :: arg

     nx             = 2
     ny             = 2
     nz             = 2
     lasteinstellung= 10000
     CALL getarg(1, arg)
     READ(arg,*) nthreads
     CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel
!$omp master
     nthreads=omp_get_num_threads()
!$omp end master
!$omp end parallel
     WRITE(*,*) "Running OpenMP benchmark on ",nthreads," thread(s)"

    n_chars = 0
    idx = 0
!$omp parallel do default(none) collapse(3) &
!$omp   shared(nx,ny,nz,n_chars) &
!$omp   private(ix,iy,iz, idx) &
!$omp   private(lasteinstellung,j) !&  
    DO iz=-nz,nz
       DO iy=-ny,ny
          DO ix=-nx,nx
!                  WRITE(*,*) ix,iy,iz
             j = 0.0d0
             DO i=1,lasteinstellung
                j = j + real(i)
             END DO
!$omp critical
             n_chars = n_chars + 1               
            idx = n_chars                       
!$omp end critical
          END DO
       END DO
    END DO
  END PROGRAM

I compiled this code with gfortran -fopenmp -o test.x test.f90 and executed it with time ./test.x THREAD Executing this code gives some strange behaviour depending on the thread-count (set with OMP_SET_NUM_THREADS): compared with one thread (6ms) the execution with more threads costs a lot more time (2 threads: 16000ms, 4 threads: 9000ms) on my multicore machine. What could cause this behaviour? Is there a better (but still easy) way to generate load without running in some cache-effects or related things?

edit: strange behaviour: if I have the write in the nested loops, the execution speeds dramatically up with 2 threads. If its commented out, the execution with 2 or 3 threads takes forever (write shows very slow incrementation of loop variables)...but not with 1 or 4 threads. I tried this code also on another multicore machine. There it takes for 1 and 3 threads forever but not for 2 or 4 threads.

Welcome at StackOverflow. If you have a problem with a code you should create a Minimal, Complete, and Verifiable example http://stackoverflow.com/help/mcve. Which we can try to run ourselves. — Vladimir F Героям слава, Jul 15 '16 at 12:45
Does this "other machine" have hyperthreading? [Threads running on two logical cores of the same physical CPU core can communicate much more quickly than on separate physical cores](http://stackoverflow.com/questions/32979067/what-will-be-used-for-data-exchange-between-threads-are-executing-on-one-core-wi/32981256). I don't know Fortran, so I didn't even try to skim the code to see if this might be part of the explanation. — Peter Cordes, Jul 17 '16 at 12:55

Vladimir F Героям слава · Accepted Answer · 2016-07-15T13:09:27.630

2

If the code you are showing is really complete you are missing definition of loadSet in the parallel section in which it is private. It is undefined and loop

                 DO i=1,loadSet
                    j = j + real(i)
                 END DO

can take a completely arbitrary number of iterations.

If the value is defined somewhere before in the code you do not show you probably want firstprivate instead of private.

edited Jul 15 '16 at 13:09

answered Jul 15 '16 at 12:55

Vladimir F Героям слава

57,977
4
76
119

Variable was defined. The problem was, that I declared it just as private. Thanks! – Jannek S. Jul 17 '16 at 08:10

Code takes much more time to finish with more than 1 thread

1 Answers1