Effective parallelization of the inner loop

Question

I have this sequential code:

for (unsigned item = 0; item < totalItems; ++item) { // Outer loop
// Outer body
  for (unsigned j = 0; j < maxSize; ++j) { // Inner loop
  // Inner body
  }
}

My goal is to simply parallelize the inner loop. It could be done like this:

for (unsigned item = 0; item < totalItems; ++item) { // Outer loop
// Outer body
  #pragma omp parallel for
  for (unsigned j = 0; j < maxSize; ++j) { // Inner loop
  // Inner body
  }
}

The problem of this code is that on every run of the outer loop new threads are spawned. In order to speed up this code, I want to create a team of threads in advance and used them multiple times. I found that for this purpose there is a directive #pragma omp for.

#pragma omp parallel
for (unsigned item = 0; item < totalItems; ++item) { // Outer loop
// Outer body
  #pragma omp for
  for (unsigned j = 0; j < maxSize; ++j) { // Inner loop
  // Inner body
  }
}

However, if I understand it correctly the usage of the directive #pragma omp parallel leads to the fact that outer loop is run multiple time. Is this correct?

Edit: Here a more detailed example:

// Let say that the image is represented as an array of pixels
// where pixels is just one integer.
std::vector<Image> images = getImages();

for (auto & image : images) { // Loop over all images
  #pragma omp parallel for
  for (unsigned j = 0; j < image.size(); ++j) { // Loop over each pixel
    image.at(j) += addMagicConstant(j);      
  }
}

Goal: I want to spawn a team of threads and then used them repeatedly to parallelize only the inner loop (= loop over the image pixels).

Possible duplicate of [omp parallel vs. omp parallel for](https://stackoverflow.com/questions/1448318/omp-parallel-vs-omp-parallel-for) — OznOg, Oct 27 '18 at 08:03
see https://bisqwit.iki.fi/story/howto/openmp/ "The parallel construct starts a parallel block. It creates a team of N threads" thus threads remains there during the whole parallel section; they are "used" in the for loop — OznOg, Oct 27 '18 at 08:06
Possible duplicate of [How does OpenMP handle nested loops?](https://stackoverflow.com/q/13357065/608639) and [Nested loops, inner loop parallelization, reusing threads](https://stackoverflow.com/q/27342408/608639). If I am parsing things correctly, only the outer loop needs OpenMP attribute of `#pragma omp parallel for collapse(2)`. — jww, Oct 27 '18 at 08:11
This is really over thinking and under estimating the OpenMP powerful features. by default OMP will make the thread pool no need to worry about that. — Mahmoud Fayez, Oct 27 '18 at 08:53
We cannot answer this without knowing anything about * "inner body"* and *"outer body"*. Please prepare a [mcve] or at the very least clearly describe what is done in those regions — Zulan, Oct 27 '18 at 09:15
what do you mean by "to parallelize only the inner loop"? Be more specific. What is parallelizing inner loop and how is it different from parallelizing outer loop. — user31264, Oct 27 '18 at 17:31
After you've edited your question my response is even more clearly the correct answer. — Vlatko Šurlan, Oct 29 '18 at 11:31

score 1 · Answer 1 · answered Oct 30 '18 at 14:39

Your code is perfectly valid and will indeed work:

#pragma omp parallel
for (unsigned item = 0; item < totalItems; ++item) { // Outer loop
// Outer body
  #pragma omp for
  for (unsigned j = 0; j < maxSize; ++j) { // Inner loop
  // Inner body
  }
}

#pragma omp parallel will spawn the threads. Each thread will then proceed through the outer loop. At each loop iteration, the threads will hit the #pragma omp for and the inner loop will be distributed among the threads. There is an implicit barrier at the end of each omp for block, so threads will wait until the inner loop has been completed before moving to the next outer loop iteration.

Having an omp for distributed loop within another for or while loop, or within a conditional section is possible if it is guaranteed that all threads will go into the loop.

It is forbidden however to use constructs such as :

#pragma omp parallel
for (unsigned ii= 0; ii< omp_thread_num(); ++ii) { // Number of iteration of outer loop depends on the thread
// Outer body
  #pragma omp for
  for (unsigned j = 0; j < maxSize; ++j) { // Inner loop
  // Inner body
  }
}

or

#pragma omp parallel
if(condition_depending_on_thread_num) { 
  #pragma omp for
  for (unsigned j = 0; j < maxSize; ++j) { // Loop
  // Inner body
  }
}

score -1 · Answer 2 · answered Oct 27 '18 at 08:43

-1

Did you try:

#pragma omp parallel for
for (unsigned item = 0; item < totalItems; ++item) { // Outer loop
  for (unsigned j = 0; j < maxSize; ++j) { // Inner loop
  }
}

answered Oct 27 '18 at 08:43

Vlatko Šurlan

405
3
10

This will parallelize also the outer loop and that wasn't what I really want to. – John Carbon Oct 27 '18 at 17:04
That is exactly what you want. You want each thread to work on a single item. That way you will avoid cache miss orgy and get the maximum performance. – Vlatko Šurlan Oct 29 '18 at 09:39

Effective parallelization of the inner loop

2 Answers2