I'm trying to parallelise a for loop, and I'm encountering undesired behaviour. The loop calls a taxing function (which contains another for loop) and then prints the result. I've parallelised the loop using #pragma omp parallel for
.
The behaviour I'm seeing is: the CPU gets fully utilised at the start, and then near the end, it suddenly drops back down to 25% utilisation. My guess is that one task gets allocated to a thread, and then as most of the tasks get completed, the system waits for the newer ones to complete. Though if that were the case, I would've seen drops to 75%, 50%, and then 25%, but no, it drops straight to 25%.
I've tried to parallelise the function itself, but it made no difference. Removing the parallelisation on the loop resulted in a behaviour where usage would spike to 100%, then drop to 25%, and then repeat like that throughout execution, which resulted in even worse performance than before. I also tried a bunch of other options for the for loop like schedule.
How would I be able to assign unused threads to the last newly created tasks? Or is something like this not possible in OpenMP?