Efficiency of using omp parallel for inside omp parallel omp task

Question

I am trying to improve the efficiency of a complicated project using OpenMP. I will use a toy example to show the situation I am facing.

First, I have a recursion function and I use omp task to improve it, like this:

void CollectMsg(Node *n) {
#pragma omp task
{
    CollectMsg(n->child1);
    ProcessMsg1(n->array, n->child1->array);
}

#pragma omp task
{
    CollectMsg(n->child2);
    ProcessMsg1(n->array, n->child2->array);
}

#pragma omp taskwait
    ProcessMsg2(n->array);
}

void main() {

    #pragma omp parallel num_threads(2)
    {
#pragma omp single
        {
            CollectMsg(root);
        }
    }
}

Then, however, the function ProcessMsg1 contains some computations and I want to use omp parallel for on it. The following code shows a simple example. In this way, the project runs correctly, but the efficiency of it is just slightly improved. I am wondering the working mechanism of this process. Does it work if I use omp parallel for inside omp task? I have this question because there is only a slight difference on efficiency between using and not using omp parallel for in function ProcessMsg1. If the answer is yes, first I created 2 threads for omp task, and in the processing of each task, I created 2 threads for omp parallel for, then should I think of this parallel process as 2*2=4 threaded parallelism? If the answer is no, then how can I further improve the efficiency of this case?

void ProcessMsg1(int *a1, int *a2) {
#pragma omp parallel for num_threads(2)
    for (int i = 0; i < size; ++i) {
        a1[i] *= a2[i];
    }
}

score 3 · Answer 1 · answered Mar 22 '22 at 17:21

Nesting a parallel for inside a task will work, but will require you enable nested execution and will lead to all sorts of difficult issues to resolve, especially around load balancing and the actual number of threads being created.

I would recommend using the taskloop construct instead:

void ProcessMsg1(int *a1, int *a2) {
#pragma omp taskloop grainsize(100)
    for (int i = 0; i < size; ++i) {
        a1[i] *= a2[i];
    }
}

This will create additional tasks for to execute the loop. If ProcessMsg1 is executed from different tasks in the system, the tasks created by taskloop will mix automatically with all the other tasks that your code is creating.

Depending on how much work your code does per loop iteration, you may want to adjust the size of the tasks using the grainsize clause (I just put 100 to have an example). See the OpenMP Spec for further details.

If you your code does not have to wait until all of the loop in ProcessMsg1 has been executed, then you can add the nowait clause, which basically means that once all loop tasks have been created, tasks can still execute on other threads, while the ProcessMsg1 function is already exiting. The default behavior taskloop though is that it will wait for the loop tasks (and any potential child tasks they create) have completed.

Thank you. I have tried omp taskloop. So omp taskloop does not have the issue of the actual threads being created, right? I mean, the threads used by the taskloop region are the current thread team created by the task region, right? — Joxixi, Mar 23 '22 at 10:52
Exactly. All threads are created at the start of the parallel region and are then available to execute tasks. — Michael Klemm, Mar 24 '22 at 11:12

score 1 · Answer 2 · edited Mar 25 '22 at 21:23

Q :
_{" ... how can I further improve the efficiency of this case? "}

A :
follow the "economy-of-costs"
some visible
some less, yet nevertheless also deciding the resulting ( im )performance & ( in )efficiency

How much did we already have to pay,
before any PAR-section even started?

Let's start with the costs of the proposed adding multiple streams of code-execution :

Using the amount of ASM-instructions a simplified measure of how much work has to be done ( where all of CPU-clocks + RAM-allocation-costs + RAM-I/O + O/S-system-management time spent counts ), we start to see the relative-costs of all these ( unavoidable ) add-on costs, compared to the actual useful task ( i.e. how many ASM-instructions are finally spent on what we want to computer, contrasting the amounts of already burnt overhead-costs, that were needed to make this happen )

This fraction is cardinal if fighting for both performance & efficiency (of resources usage patterns).

For cases, where add-on overhead costs dominate, these cases are straight sin of anti-patterns.

For cases, where add-on overhead costs make less than 0.01 % of the useful work, we still may result in unsatisfactory low speed-ups (see the simulator and related details).

For cases, where the scope of useful work diminishes all add-on overhead costs, there we still see the Amdahl's Law ceiling - called so, so self-explanatory - a "Law of diminishing returns" ( since adding more resources ceases to improve the performance, even if we add infinitely many CPU-cores and likes ).

Tips for experimenting with the Costs-of-instantiation(s) :

We start with assuming, a code is 100% parallelisable ( having no pure-[SERIAL] part, which is never real, yet let's use it ).

- let's move the N_CPUcores-slider in the simulator to anything above 64-cores

- next move the O_verhead-slider in the simulator to anything above a plain zero ( expressing a relative add-on cost of spawning a one of N_CPUcores processes, as a number of percent, relative to the such [PARALLEL]-section part number of instructions - mathematically "dense" work has many such "useful"-instructions and may, supposing no other performance killers jump out of the box, may spends some reasonable amount of "add-on" costs, to spawn some amount of concurrent- or parallel-operations ( the actual number depends only on actual economy of costs, not on how many CPU-cores are present, the less on our "wishes" or scholastic or even worse copy/paste-"advice" ). On the contrary, mathematically "shallow" work has almost always "speed-ups" << 1 ( immense slow-downs ), as there is almost no chance to justify the known add-on costs ( paid on thread/process-instantiations, data SER/xfer/DES if moving params-in and results-back, the worse if among processes )

- next move the O_verhead-slider in the simulator to the rightmost edge == 1. This shows the case, when the actual thread/process-spawning overhead-( a time lost )-costs are still not more than a just <= 1 % of all the computing-related instructions next, that are going to be performed for the "useful"-part of the work, that will be computed inside the such spawned process-instance. So even such 1:100 proportion factor ( doing 100x more "useful"-work than the lost CPU-time, burnt for arranging that many copies and making O/S-scheduler orchestrate concurrent execution thereof inside the available system Virtual-Memory ) has already all the warnings visible in the graph of the progression of Speed-up-degradation - just play a bit with the O_verhead-slider in the simulator, before touching the others...

- only here touch and move the p-slider in the simulator to anything less than 100% ( having no [SERIAL]-part of the problem execution, which was nice in theory so far, yet never doable in practice, even the program launch is a pure-[SERIAL], by design )

So,
besides straight errors,
besides performance anti-patterns,
there are lot of technical reasoning for ILP, SIMD-vectorisations, cache-line respecting tricks, that first start to squeeze out the maximum possible performance the task can ever get

refactoring of the real problem shall never go against a collected knowledge about performance as repeating the things that do not work will not bring any advantage, will it?
respect your physical platform constraints, ignoring them will degrade your performance
benchmark, profile, refactor
benchmark, profile, refactor
benchmark, profile, refactor no other magic wand available here.

Details matter, always. The CPU / NUMA architecture details matter, and a lot. Check any possibilities for the actual native-architecture possibilities, as without all these details, the runtime performance will not reach the capabilities technically available.

Efficiency of using omp parallel for inside omp parallel omp task

2 Answers2

How much did we already have to pay, before any PAR-section even started?

How much did we already have to pay,
before any PAR-section even started?