1

I have this simple self-contained example of a very rudimentary stencil application to work with OpenMP tasks and the dependence clause. At 2 steps one location of an array is added 3 values from another array, one from the corresponding location and its left and right neighbours. To avoid data races I have set up dependencies so that for every section on the second update its task can only be scheduled if the relevant tasks for the sections from the first update step are executed. I get the expected results but I am not sure if my assumptions are correct, because these tasks might be immediately executed by the encountering threads and not spawned. So my question is whether the tasks that are created in worksharing loops all sibling tasks and thus are the dependencies retained just like when the tasks are generated inside a single construct.


#include <iostream>
#include <omp.h>
#include <math.h>

typedef double value_type;
int main(int argc, char * argv[]){

    std::size_t size = 100000;
    std::size_t part_size = 25;

    std::size_t parts = ceil(float(size)/part_size);
    std::size_t num_threads = 4;

    value_type * A = (value_type *) malloc(sizeof(value_type)*size);
    value_type * B = (value_type *) malloc(sizeof(value_type)*size);
    value_type * C = (value_type *) malloc(sizeof(value_type)*size);


    for (int i = 0; i < size; ++i) {
        A[i] = 1;
        B[i] = 1;
        C[i] = 0;
    }


#pragma omp parallel num_threads(num_threads)
{

    #pragma omp for schedule(static)
        for(int part=0; part<parts; part++){
            std::size_t current_part = part * part_size;
            std::size_t left_part = part != 0 ? (part-1)*part_size : current_part;
            std::size_t right_part = part != parts-1 ? (part+1)*part_size : current_part;
            std::size_t start = current_part;
            std::size_t end = part == parts-1 ? size-1 : start+part_size;
            if(part==0) start = 1;

            #pragma omp task depend(in: A[current_part], A[left_part], A[right_part]) depend(out: B[current_part])
            {
                for(int i=start; i<end; i++){
                    B[i] += A[i] + A[i-1] + A[i+1];
                }
            }
        }
    #pragma omp for schedule(static)
        for(int part=0; part<parts; part++){
            int current_part = part * part_size;
            std::size_t left_part = part != 0 ? (part-1)*part_size : current_part;
            std::size_t right_part = part != parts-1 ? (part+1)*part_size : current_part;
            std::size_t start = current_part;
            std::size_t end = part == parts-1 ? size-1 : start+part_size;
            if(part==0) start = 1;
            #pragma omp task depend(in: B[current_part], B[left_part], B[right_part]) depend(out: C[current_part])
            {
                for(int i=start; i<end; i++){
                    C[i] += B[i] + B[i-1] + B[i+1];
                }
            }
        }
}



    value_type sum = 0;
    value_type max = -1000000000000;
    value_type min =  1000000000000;
    for(int i = 0; i < size; i++){
        sum+=C[i];
        if(C[i]<min) min = C[i];
        if(C[i]>max) max = C[i];
    }
    std::cout << "sum: " << sum << std::endl;
    std::cout << "min: " << min << std::endl;
    std::cout << "max: " << max << std::endl;
    std::cout << "avg: " << sum/(size) << std::endl;
    return 0;
}



user151387
  • 103
  • 7

1 Answers1

1

In OpenMP specification you can find the corresponding definitions:

sibling tasks - Tasks that are child tasks of the same task region.

child task - A task is a child task of its generating task region. A child task region is not part of its generating task region.

task region - A region consisting of all code encountered during the execution of a task. COMMENT: A parallel region consists of one or more implicit task regions

In the description of parallel construct you can read that:

A set of implicit tasks, equal in number to the number of threads in the team, is generated by the encountering thread. The structured block of the parallel construct determines the code that will be executed in each implicit task.

This practically means that in the parallel region many task regions are generated and using #pragma omp for different task region will generate explicit tasks (i.e #pragma omp task...). However, only tasks generated by the same task region are sibling tasks (not all of them!). If you want to be sure that all generated tasks are sibling tasks, you have to use a single task region (e.g. using single construct) to generate all the explicit tasks.

Note that your code gives the correct result, because there is an implicit barrier at the end of worksharing-loop construct (#pragma omp for). To remove this barrier you have to use the nowait clause and you will see that the result will be incorrect in such a case.

Another comment is that in your case the workload is smaller than parallel overheads, so my guess is that your parallel code will be slower than the serial one.

Laci
  • 2,738
  • 1
  • 13
  • 22
  • Thank you! For my actual program, my parallel approach led to an improvement but I was not sure about whether the loop has different task regions. – user151387 Feb 26 '22 at 23:02
  • I am working on a NUMA node. If I wanted to spawn one half of the tasks on one node and the other half on the second one to facilitate data locality is there a way to achieve that while preserving these dependencies? I used the `single` construct before with affinity clauses on a clang compiled program but this is too weak. Checking with `sched.h` I could not observe that the threads are working on the "right" tasks. – user151387 Feb 26 '22 at 23:16
  • 1
    @user151387 Maybe [this answer](https://stackoverflow.com/a/11975593/10107454) helps. – paleonix Feb 27 '22 at 01:43
  • 1
    Set the `OMP_PLACES` and `OMP_PROC_BIND` (or `proc_bind` clause) environmental variables and you can also exploit the "first touch policy". Have you tried these? – Laci Feb 27 '22 at 05:37
  • 1
    Use environment variable `OMP_DISPLAY_ENV` to verify the affinity settings. – Laci Feb 27 '22 at 05:43
  • I am already using both environmental variables, which is why the data is distributed half/half on 2 nodes through a worksharing loop with static scheduling on the data writing step. The issue is in the computation step when inside the single construct the tasks are spawned through a loop similar to this example. Before the iterations for the tasks that should be scheduled on the other node are reached all threads from both nodes are already working on tasks for the first half of the array, i.e. Half of the threads are working on nonlocal memory. – user151387 Feb 27 '22 at 07:26
  • 1
    Maybe the best idea is to post a new question about this issue, showing all the details (your computer achitecture, how `OMP_PLACES` and `OMP_PROC_BIND` was exactly set, how data was first initialized (i.e first touch), later modified, etc.). – Laci Feb 27 '22 at 07:42
  • I have now posted my question [here](https://stackoverflow.com/questions/71284069/how-can-i-realize-data-local-spawning-or-scheduling-of-tasks-in-openmp-on-numa-c). I would be grateful if you take a look. – user151387 Feb 27 '22 at 10:44