OpenMP make threads wait for the next parallel section

Question

Once the program has executed a parallel section, all the threads except for the master get destroyed. As the program enters another parallel section, threads are spawned again. All this routine causes some overhead which may significantly limit the efficiency of multiprocessing.

Consider a nested loop with the inner one being parallelized:

for (int i = 0; i < outer_loops; i++){
    // some code which does not affect the execution time noticeably
#pragma omp parallel for
    for (int j = 0; j < inner_loops; j++) {
        // some code to run through with multiple threads
    }
}

Is there a way to avoid slave threads to be spawned anew each time the program enters the inner loop? For example, once the inner loop is done, the master thread takes care of the outer loop, but the slave threads do not get destroyed to be spawned again, they just wait for the next iteration of the inner loop.

In my case, there is no way I can include the outer loop into the parallel section as it must be sequential. I'm trying to minimize the overhead caused by thread spawn.

You can use `barrier`, `nowait`, or `master` pragma in the inner loop according to your algorithm. "The omp barrier directive identifies a synchronization point at which threads in a parallel region will wait until all other threads in that section reach the same point. Statement execution past the omp barrier point then continues in parallel." , see [IBM DOC](https://www.ibm.com/docs/en/xl-c-aix/12.1.0?topic=processing-pragma-omp-barrier) — Alimpk, Apr 08 '22 at 23:00
"I'm trying to minimize the overhead caused by thread spawn." I have nothing to add to the answer already given, except to ask: do you have any evidence that there is an actual problem? — Victor Eijkhout, Apr 09 '22 at 15:59
@VictorEijkhout yes, here https://stackoverflow.com/questions/71788162/openmp-thread-spawn-overhead — Kaiyakha, Apr 09 '22 at 16:02
I don't believe those tests. The comment by @jerome is right. — Victor Eijkhout, Apr 09 '22 at 16:38

score 4 · Answer 1 · answered Apr 08 '22 at 22:56

Once the program has executed a parallel section, all the threads except for the master get destroyed. As the program enters another parallel section, threads are spawned again.

This is wrong in practice. An OpenMP implementation can keep threads alive even outside a parallel section and recycle the threads. In fact, GOMP (OpenMP runtime of GCC) and IOMP (OpenMP runtime of Clang and ICC) does this optimization (as long as some constraints are fulfilled like having the same number of threads in the different sections).

All this routine causes some overhead which may significantly limit the efficiency of multiprocessing.

Most of the overheads does not comes from the creation of the threads but the initialisation of some internal data structures, the weak up of threads (if needed regarding your settings), their synchronisation and the actual worksharing operation. With an active waiting policy (see OMP_WAIT_POLICY), the initialisation of the internal data structure should be the most expensive part though it is generally pretty fast.

Is there a way to avoid slave threads to be spawned anew each time the program enters the inner loop?

You can using one parallel section and tell OpenMP that a part is only executed by the master thread:

#pragma omp parallel
for (int i = 0; i < outer_loops; i++){
    #pragma omp master
    {
        // some code which does not affect the execution time noticeably
    }

    // This is certainly required in your case so that workers can 
    // read the results of the master thread.
    #pragma omp barrier

    // Worksharing construct with an implicit barrier at the end
    #pragma omp for
    for (int j = 0; j < inner_loops; j++) {
        // some code to run through with multiple threads
    }
}

The two barrier slow down the execution but they are mandatory in your case. If the sequential code of the outer loop is not dependent of the result of the inner parallel loop, then you can use OpenMP tasks to speed up the computation (it reduce the need for expensive synchronizations and provide more parallelism).

Note that this code is fundamentally limited by the speed of the sequential code. The Amdahl's law state that this strongly limit the scalability of the approach. Synchronizations and worksharing overheads tends to slow down the serial part reducing scalability. In the most critical case, the parallel code can be slower. The only solutions to solve this is to reduce the time taken by the serial part OR to work on more data (see the Gustafson's_law).

with `pragma omp parallel` outside the loop and `pragma omp master` inside the loop I guess there will be many parallel outer loops running independently, won't they? — Kaiyakha, Apr 08 '22 at 23:58
Each thread execute the `for (int i = 0; i < outer_loops; i++)` loop (all iterations) but only one execute the content of the master (the others just wait until it is done). So, yes. — Jérôme Richard, Apr 09 '22 at 00:27
@Kaiyakha Note that instead of `#pragma omp master` and `#pragma omp barrier` alternatively you can use `#pragma omp single`, which has an implicit barrier at the end and it is also executed by one thread only. — Laci, Apr 09 '22 at 09:18
@JérômeRichard so, as the other threads are already busy with the outer loop (even if it's empty), there will be only one thread available after barrier, won't it? — Kaiyakha, Apr 09 '22 at 10:27
@Laci Indeed, but `single` can cause some issue with sequential codes here: the thread executing the sequential part will not always be the same (between two program executions and iterations) and some API can use thread-local storage. This can cause some API to make several thread-local storage initialisations or even crash because they assume that the initialisation is not required. I think a good example of that is the execution of MPI collectives with `MPI_THREAD_SINGLE`. This directive may also cause cache-line bouncing and conflict with frequency scaling here. — Jérôme Richard, Apr 09 '22 at 11:00
@Kaiyakha I am not sure what you mean by "available after barrier". All threads have to wait before all others reach the barrier (for each iteration). In practice, before the barrier 1 master thread will execute the sequential code and all other will be waiting for this unique thread to reach the barrier. And after the barrier, all threads should perform the inner loop in parallel. The master thread cannot start a new iteration if all other threads did not complete the inner (parallel) loop. — Jérôme Richard, Apr 09 '22 at 11:06

score 1 · Answer 2 · answered Apr 09 '22 at 22:19

The code in your other question had a number of obvious problems, so I coded it out. I'm the exact same code 10 thousand times, with a variable number of terations outside and inside the parallel region:

  const int n_iterations = 10000;
  for ( int outer=1; outer<=n_iterations; outer*=10 ) {
#   pragma omp parallel for
    for (int i=0; i<N; i++)
      y[i] = x[i] = 0.;
    x[0] = 0; x[N-1] = 1.;

    double s=0; int its=0;
    double timer = omp_get_wtime();
    int n_inner = n_iterations/outer;
    for ( int it_outer=0; it_outer<outer; it_outer++ ) {
#     pragma omp parallel
      {
        for ( int it_inner=0; it_inner<n_inner; it_inner++ ) {
#         pragma omp master
          its++;
#         pragma omp for
          for (int i=1; i<N-1; i++)
            y[i] = ( x[i-1]+x[i]+x[i+1] )/3.;
#         pragma omp for
          for (int i=1; i<N-1; i++)
            x[i] = y[i];
        } // end inner iteration
      } // end omp parallel
    } // end outer iteration
#   pragma omp parallel for reduction(+:s)
    for (int i=0; i<N; i++)
      s += y[i];
    timer = omp_get_wtime()-timer;
    printf("outer=%4d, n_inner=%4d, its=%4d, time=%7.4f, checksum=%e\n",
           outer,n_inner,its,timer,s);
  }

Results:

outer=   1, n_inner=10000, its=10000, time= 0.0766, checksum=6.464904e+01
outer=  10, n_inner=1000, its=10000, time= 0.0777, checksum=6.464904e+01
outer= 100, n_inner= 100, its=10000, time= 0.0776, checksum=6.464904e+01
outer=1000, n_inner=  10, its=10000, time= 0.0807, checksum=6.464904e+01
outer=10000, n_inner=   1, its=10000, time= 0.1142, checksum=6.464904e+01

Timing depends a little on the array size, here I used about a Megabyte. You see that the thread overhead is very little. Doing 10 thousand times thread "creation" increases the time by 50 percent. Doing thousand times only 5 percent.

OpenMP make threads wait for the next parallel section

2 Answers2

Linked