how to avoid overhead of openMP in nested loops

Question

I have two versions of code that produce equivalent results where I am trying to parallelize only the inner loop of a nested for loop. I am not getting much speedup but I didn't expect a 1-to-1 since I'm trying only to parallelize the inner loop.

My main question is why these two versions have similar runtimes? Doesn't the second version fork threads only once and avoid the overhead of starting new threads on every iteration over i as in the first version?

The first version of code starts up threads on every iteration of the outer loop like this:

for(i=0; i<2000000; i++){
  sum = 0;
  #pragma omp parallel for private(j) reduction(+:sum)
  for(j=0; j<1000; j++){
    sum += 1;
  }
  final += sum;
}
printf("final=%d\n",final/2000000);

With this output and runtime:

OMP_NUM_THREADS=1

final=1000
real    0m5.847s
user    0m5.628s
sys     0m0.212s

OMP_NUM_THREADS=4

final=1000
real    0m4.017s
user    0m15.612s
sys     0m0.336s

The second version of code starts threads once(?) before the outer loop and parallelizes the inner loop like this:

#pragma omp parallel private(i,j)
for(i=0; i<2000000; i++){
  sum = 0;
  #pragma omp barrier
  #pragma omp for reduction(+:sum)
  for(j=0; j<1000; j++){
    sum += 1;
  }
  #pragma omp single
  final += sum;
}
printf("final=%d\n",final/2000000);

With this output and runtime:

OMP_NUM_THREADS=1

final=1000
real    0m5.476s
user    0m4.964s
sys     0m0.504s

OMP_NUM_THREADS=4

final=1000
real    0m4.347s
user    0m15.984s
sys     0m1.204s

Why isn't the second version much faster than the first? Doesn't it avoid the overhead of starting threads on every loop iteration or am I doing something wrong?

Probably because your implementation of OpenMP has effectively translated your first version into your second version. Most implementations are smart enough to keep your threads alive if it sees they are constantly being created inside an outer loop. — NoseKnowsAll, Jun 06 '16 at 05:14
Your second version is better even with thread pools, see [here](https://stackoverflow.com/questions/31321071/openmp-nested-for-loop-becomes-faster-when-having-parallel-before-outer-loop/31382775#31382775). — Z boson, Jun 06 '16 at 08:06
Oh, I'm surprised your second version is slower. Did you compile with optimization? I tried your code with optimization and without OpenMP the inner loop is optimzed to simply 1000 but with OpenMP it's not. You probably did not compile with optimization. — Z boson, Jun 06 '16 at 08:44
No I didn't want to compile with optimization for that reason. The `sum += 1` should really be something like `sum += non_constant`. I guess the first version gets compiled into the second version as suggested here. — nick, Jun 06 '16 at 18:13
Well I think you vil find few people who actually care about the results with one thread and several threads without optimization enabled. You have to do your tests with optimizating. It's offten a challenge to get the compiler to not optimizie something away but that's what you have to do because if you tell people you did not use optimizating because of this they won't be interested. — Z boson, Jun 06 '16 at 19:58
Also, I highly doubt the first version gets converted to the second even with optimization. OpenMP implentations use a thread pool so in many cases you probably won't noitce a differecne between the two but that does not mean they are equavlaent. — Z boson, Jun 06 '16 at 19:59
Good points. Next time I test I'll run a more realistic case and use optimization. My actual problem is more like a random walk with the outer loop being a time step and the inner loop being a bunch of walkers that I want to parallelize. In the end, I converted my code to MPI and it's fast by factors of number of processes so I'm happy with it. — nick, Jun 06 '16 at 20:36
You should have linked to your question [here](http://stackoverflow.com/q/37622315/2542702). — Z boson, Jun 07 '16 at 06:14
I assume that your inner loop is too trivial and have too small amount of computations, thus the parallelization overheads counteract concurency bounties. Try to use computation hard operation instead of sum+=1. — Badiboy, Jun 12 '16 at 20:06
And one more: the second version is not equivalent to the first one. In the first version you'll have 2M outer loop interations executed. In the second version you'll have 2M*NumThreads interations executed, because EVERY thread will execute outer loop independently. — Badiboy, Jun 12 '16 at 20:08

score 1 · Answer 1 · answered Jul 07 '16 at 12:23

An OpenMP implementation may use thread pooling to eliminate the overhead of starting threads on encountering a parallel construct. A pool of OMP_NUM_THREADS threads is started for the first parallel construct, and after the construct is completed the slave threads are returned to the pool. These idle threads can be reallocated when a later parallel construct is encountered.

See for example this explanation of thread pooling in the Sun Studio OpenMP implementation.

@nick, out of interest, what C compiler / OpenMP implementation are you using? — Josh Milthorpe, Jul 07 '16 at 19:27

score 0 · Answer 2 · answered Sep 06 '16 at 18:43

You appear to be retracing the steps of Amdahl's Law: It speaks of parallel process vs it's own overhead. One thing that Amadhl found was regardless of how much parallelism you put into a program, it will always have to same speedup to begin with. Parallelism only starts to improve run time/performance when the program requires enough work to compensate the extra processing power.

how to avoid overhead of openMP in nested loops

2 Answers2