Using multiple OMP parallel sections in parallel -> Performance Issue?

Question

I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:

I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:

void algorithm()
{
  #pragma omp parallel for numThreads(12)
  for (int i=0; ...)
  {
     // do some heavy computation (pure memory and CPU work, no I/O, no waiting)
  }

  // ... some more for-loops of this kind
}

The application executes this algorithm n times in parallel from n different threads:

std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);

t1.join();
t2.join();
//...
tn.join();
// end of application

Now, the problem is as follows:

when I run the application with n=1 (only one call to algorithm()) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads).
when I run with n=2, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.

This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:

if I run my application twice in two parallel processes, each with n=1, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.

This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.

Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!

Thanks a lot,

Da

PS.:

I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.

I assume (with zero evidence) that the call to join screws things up for you. Does the join come after all work is done ? — xyious, Sep 08 '17 at 18:02
Not sure I understand you correctly: the call to `join()` should block until the respective thread finishes the work, right? So yes, the join's are meant to mean "wait until all work is done". — Da Jogh, Sep 08 '17 at 18:07
I'm not sure how to explain my train of thought correctly.... If you join all the threads at some point in the middle and then start more threads, you'll be waiting for the last thread to finish. That slows things down more the more threads you have. — xyious, Sep 08 '17 at 18:18
Ah, now I know what you mean. I am really just joining all threads at the very end, as in the code snippet above. The application immediately exits as soon as the last thread has completed. Does that answer your question? (added comment above to clarify this) — Da Jogh, Sep 08 '17 at 18:21
1) You are operating outside of the realm of specifications. It's not impossible - but it's complicated. See also [this lengthy discussion](https://stackoverflow.com/a/13839719/620382). 2) Have you tried to create a [mcve]? As you noted, there could be many reasons - and it's not good to guess in an answer. — Zulan, Sep 09 '17 at 02:37

Using multiple OMP parallel sections in parallel -> Performance Issue?

0 Answers0