1

I have a function f that I can use parallel processing. For this purpose, I used openmp. However, this function is called many times, and it seems that thread creation is done every call.

How can we reuse the thread?

void f(X &src, Y &dest) {
   ... // do processing based on "src"
   #pragma omp parallel for
   for (...) { 

   }
   ...// put output into "dest" 
}

int main() {
    ...
    for(...) { // It is impossible to make this loop processing parallel one.
       f(...);
    }
    ...
    return 0;
}
user9414424
  • 498
  • 3
  • 15
  • Use a thread pool. – Jesper Juhl May 03 '18 at 16:22
  • How can I do that on this code? – user9414424 May 03 '18 at 16:26
  • 2
    That's a fairly broad question. In *general* you'd create a number of threads up-front, then assign work to them as it becomes available (and there are available threads in the pool, otherwise you'd queue up the work). When a thread finishes its work it would go back into the pool and wait for more work to be assigned. – Jesper Juhl May 03 '18 at 16:30
  • Thank you so much. Your suggestion would be using openmp is bad solution here. Is this right? – user9414424 May 03 '18 at 17:07
  • 3
    the usage model with repeating parallel computation is good fit for OpenMP and is pretty often pattern in practice. – Anton May 03 '18 at 18:37

2 Answers2

6

OpenMP implements thread pool internally, it tries to reuse threads unless you change some of its settings in between or use different application threads to call parallel regions while others are still active.

One can verify that the threads are indeed the same by using thread locals. I'd recommend you to verify your claim about recreating the threads. OpenMP runtime does lots of smart optimizations beyond obvious thread pool idea, you just need to know how to tune and control it properly.

While it is unlikely that threads are recreated, it is easy to see how threads can go to sleep by the time when you call parallel region again and it takes noticeable amount of time to wake them up. You can prevent threads from going to sleep by using OMP_WAIT_POLICY=active and/or implementation-specific environment variables like KMP_BLOCKTIME=infinite (for Intel/LLVM run-times).

Anton
  • 6,349
  • 1
  • 25
  • 53
  • In [this question](https://stackoverflow.com/q/22479258/2542702) I implemented the parallel region inside a loop and I later found that the performance was a bit better if I moved the parallel region to the out loop. I wonder if I had tried `OMP_WAIT_POLICY=active` if the performance would have been the same. – Z boson May 04 '18 at 07:43
3

This is just in addition to Anton's correct answer. If you are really concerned about the issue, for most programs you can easily move the parallel region on the outside and keep serial work serial like follows:

void f(X &src, Y &dest) {
   // You can also do simple computations
   // without side effects outside of the single section
   #pragma omp single
   {
   ... // do processing based on "src"
   }
   #pragma omp for // note parallel is missing
   for (...) { 

   }
   #pragma omp critical
   ...// each thread puts its own part of the output into "dest" 
}

int main() {
    ...
    // make sure to declare loop variable locally or explicitly private
    #pragma omp parallel
    for(type variable;...;...) {
       f(...);
    }
    ...
    return 0;
}

Use this only if you have measured evidence that you are suffering from the overhead of reopening parallel regions. You may have to juggle with shared variables, or manually inline f, because all variables declared within f will be private - so how it looks in detail depends on your specific application.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • I wonder if this technique is even necessary with `OMP_WAIT_POLICY=active`. – Z boson May 04 '18 at 07:48
  • 1
    @Zboson `OMP_WAIT_POLICY` is kind of orthogonal and should influence "manual thread pooling". Since the standard doesn't talk about thread pools, it's certainly up to the implementation whether unused threads (as opposed to waiting threads in a current) are affected by the wait policy. Anyway - when in doubt, measure :-). In some tests with gcc 7.3.1, in some configurations the manual approach, in some the simple was slightly faster. – Zulan May 04 '18 at 08:07