I'm reading the book An introduction to parallel programming by Peter S. Pacheco. In Section 5.6.2, it gave an interesting discussion about reducing the fork/join overhead. Consider the odd-even transposition sort algorithm:
for(phase=0; phase < n; phase++){
if(phase is even){
# pragma omp parallel for default(none) shared(n) private(i)
for(i=1; i<n; i+=2){//meat}
}
else{
# pragma omp parallel for default(none) shared(n) private(i)
for(i=1; i<n-1; i+=2){//meat}
}
}
The author argues that the above code has somewhat high fork/join overhead. Because the threads are forked and joined in each iteration of the outer loop. Hence, he proposes the following version:
# pragma omp parallel default(none) shared(n) private(i, phase)
for(phase=0; phase < n; phase++){
if(phase is even){
# pragma omp for
for(i=1; i<n; i+=2){//meat}
}
else{
# pragma omp for
for(i=1; i<n-1; i+=2){//meat}
}
}
According to the authors, the second version forks the threads before the outer loop starts and reuse the threads for each iterations, yielding better performance.
However, I'm suspicious of the correctness of the second version. In my understanding, an #pragma omp parallel
directive initiates a group of threads and let the threads execute the following structured block in parallel. In this case, the structured block should be the whole outer for-loop for(phase=0 ...)
. Then, shouldn't it be the case where the whole outer loop is executed four time given 4 threads are used? That is, if n=10
, then 40 iterations would be executed on 4 threads. What is wrong with my understanding? And how does the omp parallel
(without for) play with a following for-loop like above?