OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

Question

I have some C++ code that I am running for an optimisation task, and I am trying to parallelise it using OpenMP. I tried using #pragma omp parallel for on both loops, but realised pretty quickly that it didnt work, so I want to set up a conditional to decide whether to parallelise the outer or inner loop, depending on how many outer iterations there are.

Here is the code:

std::vector<Sol> seeds; // vector with initial solutions
std::vector<Sol> sols (N_OUTER*N_INNER); // vector for output solutions
int N_OUTER; // typically 1-8
int N_INNER;  // typically > 100
int PAR_THRESH; // this is the parameter I am interested in setting

#pragma omp parallel for if (N_OUTER >= PAR_THRESH)
for (int outer = 0; outer < N_OUTER; ++outer){
    #pragma omp parallel for if (N_OUTER < PAR_THRESH)
    for (int inner = 0; inner < N_INNER; ++inner){
        sols[outer*N_INNER + inner] = solve(seeds[outer]);
    }
}

This works fine to decide which loop (inner or outer) gets parallelised; however, I am trying to determine what is the best value for PAR_THRESH.

My intuition says that if N_OUTER is 1, then it shouldn't parallelise the outer loop, and if N_OUTER is greater than the number of threads available, then the outer loop should be the one to be parallelised; because it uses maximum available threads and the threads are long as possible. My question is about when N_OUTER is either 2 or 3 (4 being the number of threads available).

Is it better to run, say, 2 or 3 threads that are long, in parallel; but not use up all of the available threads? Or is it better to run the 2 or 3 outer loops in serial, while utilising the maximum number of threads for the inner loop?

Or is there a kind of trade off in play, and maybe 2 outer loop iterations might be wasting threads, but if there are 3 outer loop iterations, then having longer threads is beneficial, despite the fact that one thread is remaining unused?

EDIT:

edited code to replace N_ITER with N_INNER in two places

Try it, measure it, and see. As it is there is far too much variability in what runs and how it runs for us to say. — 1201ProgramAlarm, Jan 13 '18 at 01:43
Thanks, I will just try both and see which is better for this application. My question was more about whether there was a rule-of-thumb (or best practice) with these things in general. I saw on another question that someone mentioned it is always better to have longer threads than shorter ones, however I was just wondering if this was the case even when you weren't utilising all available threads. — guskenny83, Jan 13 '18 at 01:50

score 1 · Accepted Answer · answered Jan 13 '18 at 02:56

Didn't have much experience with OpenMP, but I have found something like collapse directive:

https://software.intel.com/en-us/articles/openmp-loop-collapse-directive

Understanding the collapse clause in openmp

It seems to be even more appropriate when number of inner loop iterations differs.

--

On the other hand:

It seems to me that solve(...) is side-effect free. It seems also that N_ITER is N_INNER.

Currently you calculate solve N_INNER*N_OUTER times. While reducing that won't reduce O notation complexity, assuming it has very large constant factor - it should save a lot of time. You cannot cache the result with collapse, so maybe this could be even better:

std::vector<Sol> sols_tmp (N_INNER);
#pragma omp parallel for
for (int i = 0; i < N_OUTER; ++i) { 
    sols_tmp[i] = solve(seeds[i]);
}

This calculates only N_OUTER times.

Because solve returns same value for each row:

#pragma omp parallel for
for (int i = 0; i < N_OUTER*N_INNER; ++i) {
    sols[i] = sols_tmp[i/N_INNER];
}

Of course it must be measured if parallelization is suitable for those loops.

You are correct that `N_ITER = N_INNER`, that was a typo sorry. In terms of the rest of your answer, thanks, i didn't know about the `collapse` clause, but it looks like it might be a good solution. With regard to `solve()`, it is a stochastic optimisation algorithm that takes `seeds[i]` as an initial solution, but does not necessarily return the same value for the same seed. This is why I need the two loops, one to iterate through the seed solutions, and the inner loop to generate `N_INNER` solutions, based on those seeds. But thanks for the tip about `collapse`, i will look into it! — guskenny83, Jan 14 '18 at 03:32

OpenMP: Is it better to use fewer threads that are long, or maximum available threads that are short?

1 Answers1