Reduce OpenMP fork/join overhead by separating #omp parallel and #omp for

Question

I'm reading the book An introduction to parallel programming by Peter S. Pacheco. In Section 5.6.2, it gave an interesting discussion about reducing the fork/join overhead. Consider the odd-even transposition sort algorithm:

for(phase=0; phase < n; phase++){
    if(phase is even){
#       pragma omp parallel for default(none) shared(n) private(i)
        for(i=1; i<n; i+=2){//meat}
    }
    else{
#       pragma omp parallel for default(none) shared(n) private(i)
        for(i=1; i<n-1; i+=2){//meat}
    }
}

The author argues that the above code has somewhat high fork/join overhead. Because the threads are forked and joined in each iteration of the outer loop. Hence, he proposes the following version:

# pragma omp parallel default(none) shared(n) private(i, phase)
for(phase=0; phase < n; phase++){
    if(phase is even){
#       pragma omp for
        for(i=1; i<n; i+=2){//meat}
    }
    else{
#       pragma omp for
        for(i=1; i<n-1; i+=2){//meat}
    }
}

According to the authors, the second version forks the threads before the outer loop starts and reuse the threads for each iterations, yielding better performance.

However, I'm suspicious of the correctness of the second version. In my understanding, an #pragma omp parallel directive initiates a group of threads and let the threads execute the following structured block in parallel. In this case, the structured block should be the whole outer for-loop for(phase=0 ...). Then, shouldn't it be the case where the whole outer loop is executed four time given 4 threads are used? That is, if n=10, then 40 iterations would be executed on 4 threads. What is wrong with my understanding? And how does the omp parallel (without for) play with a following for-loop like above?

Alexey Kukanov · Accepted Answer · 2014-11-27T23:43:45.313

The second version is correct.

Per the OpenMP specification, a #pragma omp parallel for directive is just a shortcut for #pragma omp parallel immediately followed by #pragma omp for, as in

#pragma omp parallel
{
    #pragma omp for
    for(int i=0; i<N; ++i) { /*loop body*/ }
}

If there is some code in the parallel region before or after the loop construct, it will be executed independently by each thread in the region (unless limited by other OpenMP directives). But, #pragma omp for is a work sharing construct; the loop following that directive is shared by all threads in the region. I.e. it's executed as a single loop which iterations are somehow split across the threads. Thus, if the parallel region above is executed by 4 threads, still the loop will be executed just once, and not 4 times.

Back to the example in your question: the phase loop is executed by each thread individually, but #pragma omp for at each phase iteration indicates start of a shared loop. For n=10, each thread will enter a shared loop 10 times, and execute a part of it; so there won't be 40 executions of the inner loops, but just 10.

Note there is an implicit barrier at the end of #pragma omp for; it means a thread that completed its portion of a shared loop will not proceed until all other threads complete their portions too. So, the execution is synchronized across the threads. This is necessary to ensure correctness in most cases; e.g. in your example this guarantees that threads always work on the same phase. However if consequent shared loops within a region are safe to execute simultaneously, a nowait clause can be used to eliminate the implicit barrier and allow threads immediately proceed to the rest of the parallel region.

Note also that such processing of work sharing directives is quite specific to OpenMP. With other parallel programming frameworks, the logic you used in the question might be correct.

And last, smart OpenMP implementations do not join threads after a parallel region is completed; instead, threads might busy-wait for some time, and then sleep until another parallel region is started. This is done exactly to prevent high overheads at the start and the end of parallel regions. So, while the optimization suggested in the book still removes some overhead (perhaps), for some algorithms its impact on execution time might be negligible. The algorithm in the question is quite likely one of these; in the first implementation, parallel regions quickly follow one after another within the serial loop, so OpenMP worker threads most likely will be active at the beginning of a region and will start it quickly, avoiding the fork/join overhead. So don't be surprised if in practice you see no performance difference from the described optimization.

I suspect, as you state, that the second version won't be any more efficient. Especially if `n>>nthreads`. — Z boson, Nov 28 '14 at 08:50
Can you point me to which OpenMP implementations are "smart" so to not rejoin threads after each parallel region? — davide, Jan 26 '16 at 04:45
@davide, I can say for sure this is the case for Intel's OpenMP implementation; it's even documented (https://software.intel.com/en-us/node/522682): when a team is dissolved, "threads in the team enter a wait state until they are needed to form another team". And Clang uses the open-source version of Intel's OpenMP runtime, so that must be true for Clang as well. I don't know about other implementations. — Alexey Kukanov, Jan 28 '16 at 19:18

Reduce OpenMP fork/join overhead by separating #omp parallel and #omp for

1 Answers1

Linked