OpenMP incredibly slow when another process is running

Question

When trying to use OpenMP in a C++ application I ran into severe performance issues where the multi-threaded performance could be up to 1000x worse compared to single threaded. This only happens if at least one core is maxed out by another process.

After some digging I could isolate the issue to a small example, I hope someone can shed some light on this issue!

Minimal example

Here is a minimal example which illustrates the problem:

#include <iostream>

int main() {
    int sum = 0;
    for (size_t i = 0; i < 1000; i++) {
        #pragma omp parallel for reduction(+:sum)
        for (size_t j = 0; j < 100; j++) {
            sum += i;
        }
    }
    
    std::cout << "Sum was: " << sum << std::endl;
}

I need the OpenMP directive to be inside the outer for-loop since my real code is looping over timesteps which are dependent on one another.

My setup

I ran the example on Ubuntu 21.04 with an AMD Ryzen 9 5900X (12 cores, 24 threads), and compiled it with G++ 10.3.0 using g++ -fopenmp example.cc.

Benchmarking

If you run this program with nothing else in the background it terminates quickly:

> time ./a.out
Sum was: 999000

real    0m0,006s
user    0m0,098s
sys     0m0,000s

But if a single core is used by another process it runs incredibly slowly. In this case I ran stress -c 1 to simulate another process fully using a core in the background.

> time ./a.out
Sum was: 999000

real    0m8,060s
user    3m2,535s
sys     0m0,076s

This is a slowdown by 1300x. My machine has 24 parallel threads so the theoretical slowdown should only be around 4% when one is busy and 23 others are available.

Findings

The problem seems to be related to how OpenMP allocates/assigns the threads.

If I move the omp-directive to the outer loop the issue goes away
If I explicitly set the thread count to 23 the issue goes away (num_threads(23))
If I explicitly set the thread count to 24 the issue remains
How long it takes for the process to terminate varies from 1-8 seconds
The program constantly uses as much of the cpu as possible when it's running, I assume most of the OpenMP threads are in spinlocks

From these findings it would seem like OpenMP assigns the jobs to all cores, including the one that is already maxed out, and then somehow forcing each individual core to finish its tasks and not allowing them to be redistributed when other cores are done.

I have tried changing the scheduling to dynamic but that didn't help either.

I would be very helpful for any suggestions, I'm new to OpenMP so it's possible that I've made a mistake. What do you make of this?

You have answered your own question: 1) "OpenMP assigns the jobs to all cores". 2) "somehow forcing each individual core to finish its tasks and not allowing them to be redistributed when other cores are done." Your code exactly do this so, you have to wait for the slowest (stressed) thread to finish 1000 times. The workload is very small (just adding a few numbers), but creating and destroying a thread on a stressed core is an extremly slow process. — Laci, Nov 26 '21 at 15:56
@Laci Note that a good runtime do not actually create/destroy the threads every time (at least not GOMP of GCC nor IOMP of Clang/ICC). They keep the pool alive and only recreate a new one if the number of thread change. Still, a communication between the cores is expensive for such a very small (optimized) loop. — Jérôme Richard, Nov 26 '21 at 16:58
This is especially true since the loop content is independent of `j` and so an optimizing compiler will likely tranform the whole loop to `sum += i*100`. In practice, the `pragma omp parallel` prevent the compiler to vectorize the loop or optimizing it further. A very good compiler can replace the two loops with just few instructions: `sum = 100 * (1000*999)/2 = 49950000`. In fact GCC vectorize the loop and Clang actually does the clever optimization. Analysing performance without optimizations (`-O3`) enabled is mostly useless and this benchmark is also useless if optimizations are enabled... — Jérôme Richard, Nov 26 '21 at 17:05
The following links may help to build a proper benchmark: [Simple for() loop benchmark takes the same time with any loop bound](https://stackoverflow.com/questions/50924929) and [CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"](https://www.youtube.com/watch?v=nXaxk27zwlk). Controling the scheduling of the OpenMP loop is also critical for the performance of such a loop (using the `schedule` clause), although g++ *should* use a static schedule in practice. You should also probably care about the binding of threads to cores and many other things... — Jérôme Richard, Nov 26 '21 at 17:15
@JérômeRichard The real implementation is much more complicated than summing up numbers in the inner loop, I made it this way since it illustrates the same behaviour, not because it would make a good benchmark. I was after the reason why this is happening and it seems that @Homer512 found out why. Thanks for your feedback though, it's good to consider if the `omp` pragmas actually make the program faster or not. — Sebastian Hjelm, Nov 29 '21 at 08:44

Homer512 · Accepted Answer · 2021-11-27T12:05:45.560

So here is what I could figure out:

Run the program with OMP_DISPLAY_ENV=verbose (see https://www.openmp.org/spec-html/5.0/openmpch6.html for a list of environment variables)

The verbose setting will show you OMP_WAIT_POLICY = 'PASSIVE' and GOMP_SPINCOUNT = '300000'. In other words, when a thread has to wait, it will spin for some time before going to sleep, consuming CPU time and blocking one CPU. This will happen each time the thread reaches the end of the loop or before the the master thread distributes the for loop, or maybe even before the parallel section starts.

Because GCC's libgomp does not use pthread_yield, this effectively blocks one CPU thread. Because you have more running software threads than CPU threads, one will not be running, causing all others to busy-wait until the kernel scheduler reassigns the CPU.

If you call your program with OMP_WAIT_POLICY=passive, GCC will set GOMP_SPINCOUNT = '0'. Then the kernel will immediately put waiting threads to sleep and allow the others to run. Now your performance will be much better.

Interestingly enough OMP_PROC_BIND=true also helps. I assume immovable threads affect the kernel scheduler in some way that benefits us but I'm not sure.

Clang's OpenMP implementation does not suffer from this performance degradation because it uses pthread_yield. Of course this has its own drawbacks if syscall overhead is large and in most computing environments, it should be unnecessary because you are not supposed to overcommit CPUs.

Thanks for your answer, this makes a lot of sense! I agree that overcommitting CPUs is bad generally, my issue arised when my program was running in our CI system where we want to share resources between different executors on the same machine since most tasks require little CPU but some require more (and will run for a long time). — Sebastian Hjelm, Nov 29 '21 at 09:45
Interestingly, OpenMP uses 24 threads even if I reduce the amount of tasks in the inner loop to only 2. Seems like a waste of resources when the amount of tasks are known at compile time. — Sebastian Hjelm, Nov 29 '21 at 09:49

OpenMP incredibly slow when another process is running

Minimal example

My setup

Benchmarking

Findings

1 Answers1