OpenMP C++: Load imbalance with parallelised for loop

Question

I'm trying to parallelise a for loop, and I'm encountering undesired behaviour. The loop calls a taxing function (which contains another for loop) and then prints the result. I've parallelised the loop using #pragma omp parallel for.

The behaviour I'm seeing is: the CPU gets fully utilised at the start, and then near the end, it suddenly drops back down to 25% utilisation. My guess is that one task gets allocated to a thread, and then as most of the tasks get completed, the system waits for the newer ones to complete. Though if that were the case, I would've seen drops to 75%, 50%, and then 25%, but no, it drops straight to 25%.

I've tried to parallelise the function itself, but it made no difference. Removing the parallelisation on the loop resulted in a behaviour where usage would spike to 100%, then drop to 25%, and then repeat like that throughout execution, which resulted in even worse performance than before. I also tried a bunch of other options for the for loop like schedule.

How would I be able to assign unused threads to the last newly created tasks? Or is something like this not possible in OpenMP?

Zulan · Accepted Answer · 2019-09-11T11:59:04.013

If your guess is correct, then you should apply schedule(dynamic) to your loop, which has the following effect:

When kind is dynamic, the iterations are distributed to threads in the team in chunks. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the chunk that contains the sequentially last iteration, which may have fewer iterations. When no chunk_size is specified, it defaults to 1.

You can also experiment with increasing the chunk_size (e.g., schedule(dynamic,16)) or using schedule(guided):

When kind is guided, the iterations are assigned to threads in the team in chunks. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. [...]

Take a look at this answer for a detailed discussion about dynamic vs guided schdules.

In general, I recommend to never guess about performance. Use a sophisticated performance analysis tool that understands OpenMP and can tell you about the actual potential for optimization in your code.

Adding `schedule(dynamic)` did the trick. I noticed more natural behaviour in the drops. What I mean is, there was a lot of variation in the drops. The worst I saw was it dropping to 25% utilisation for 1.25 seconds, and the best was a quick dip to 75%. There were a bunch of other behaviours in between such as 75%, then 50%, or 50% to name a couple. I experimented with the chunk size, but increasing it only made performance worse. Thanks for your help. Also, which tools can I use to analyse this stuff? Already tried Visual Studio's profiler, but that didn't show me much info on OpenMP. — , Sep 11 '19 at 10:52
I use [Score-P](http://score-p.org/) and [Vampir](https://vampir.eu/) - see also [this related answer](https://stackoverflow.com/a/43047074/620382) - which I totally forgot about when writing this answer. But there are other tools which get you the same information. — Zulan, Sep 11 '19 at 12:46

OpenMP C++: Load imbalance with parallelised for loop

1 Answers1