I am trying to understand a huge performance problem with one of our C++ applications using OpenMP (on Windows). The structure of the application is as follows:
I have an algorithm which basically consists of a couple of for-loops which are parallelized using OpenMP:
void algorithm()
{
#pragma omp parallel for numThreads(12)
for (int i=0; ...)
{
// do some heavy computation (pure memory and CPU work, no I/O, no waiting)
}
// ... some more for-loops of this kind
}
The application executes this algorithm n
times in parallel from n
different threads:
std::thread t1(algorithm);
std::thread t2(algorithm);
//...
std::thread tn(algorithm);
t1.join();
t2.join();
//...
tn.join();
// end of application
Now, the problem is as follows:
- when I run the application with
n=1
(only one call toalgorithm()
) on my system with 32 physical CPU cores (no hyperthreading), it takes about 5s and loads the CPU to about 30% as expected (given that I have told OpenMP to only use 12 threads). - when I run with
n=2
, the CPU load goes up to about 60%, but the application takes almost 10 seconds. This means that it is almost impossible to run multiple algorithm instances in parallel.
This alone, of course, can have many reasons (including cache misses, RAM bandwidth limitations, etc.), but there is one thing that strikes me:
- if I run my application twice in two parallel processes, each with
n=1
, both processes complete after about 5 seconds, meaning that I was well able to run two of my algorithms in parallel, as long as they live in different processes.
This seems to exclude many possible reasons for this performance bottleneck. And indeed, I have been unable to understand the cause of this, even after profiling the code. One of my suspicions is that there might be some excessive synchronization in OpenMP between different parallel sections.
Has anyone ever seen an effect like this before? Or can anyone give me advice how to approach this? I have really come to a point where I have tried all I can imagine, but without any success so far. I thus appreciate any help I can get!
Thanks a lot,
Da
PS.:
- I have been using both, MS Visual Studio 2015 and Intel's 2017 compiler - both show basically the same effect.
- I have a very simple reproducer showing this problem which I can provide if needed. It is really not much more than the above, just adding some real work to be done inside the for-loops.