When trying to use OpenMP in a C++ application I ran into severe performance issues where the multi-threaded performance could be up to 1000x worse compared to single threaded. This only happens if at least one core is maxed out by another process.
After some digging I could isolate the issue to a small example, I hope someone can shed some light on this issue!
Minimal example
Here is a minimal example which illustrates the problem:
#include <iostream>
int main() {
int sum = 0;
for (size_t i = 0; i < 1000; i++) {
#pragma omp parallel for reduction(+:sum)
for (size_t j = 0; j < 100; j++) {
sum += i;
}
}
std::cout << "Sum was: " << sum << std::endl;
}
I need the OpenMP directive to be inside the outer for-loop since my real code is looping over timesteps which are dependent on one another.
My setup
I ran the example on Ubuntu 21.04 with an AMD Ryzen 9 5900X (12 cores, 24 threads), and compiled it with G++ 10.3.0 using g++ -fopenmp example.cc
.
Benchmarking
If you run this program with nothing else in the background it terminates quickly:
> time ./a.out
Sum was: 999000
real 0m0,006s
user 0m0,098s
sys 0m0,000s
But if a single core is used by another process it runs incredibly slowly. In this case I ran stress -c 1
to simulate another process fully using a core in the background.
> time ./a.out
Sum was: 999000
real 0m8,060s
user 3m2,535s
sys 0m0,076s
This is a slowdown by 1300x. My machine has 24 parallel threads so the theoretical slowdown should only be around 4% when one is busy and 23 others are available.
Findings
The problem seems to be related to how OpenMP allocates/assigns the threads.
- If I move the omp-directive to the outer loop the issue goes away
- If I explicitly set the thread count to 23 the issue goes away (
num_threads(23)
) - If I explicitly set the thread count to 24 the issue remains
- How long it takes for the process to terminate varies from 1-8 seconds
- The program constantly uses as much of the cpu as possible when it's running, I assume most of the OpenMP threads are in spinlocks
From these findings it would seem like OpenMP assigns the jobs to all cores, including the one that is already maxed out, and then somehow forcing each individual core to finish its tasks and not allowing them to be redistributed when other cores are done.
I have tried changing the scheduling to dynamic but that didn't help either.
I would be very helpful for any suggestions, I'm new to OpenMP so it's possible that I've made a mistake. What do you make of this?