16

Considering :

    void saxpy_worksharing(float* x, float* y, float a, int N) {
      #pragma omp parallel for
      for (int i = 0; i < N; i++) {
         y[i] = y[i]+a*x[i];
      }
    }

And

    void saxpy_tasks(float* x, float* y, float a, int N) {
      #pragma omp parallel
      {
         for (int i = 0; i < N; i++) {
         #pragma omp task
         {
           y[i] = y[i]+a*x[i];
         }
      }
   }

What is the difference using tasks and the omp parallel directive ? Why can we write recursive algorithms such as merge sort with tasks, but not with worksharing ?

user1511956
  • 784
  • 3
  • 9
  • 22

1 Answers1

30

I would suggest that you have a look at the OpenMP tutorial from Lawrence Livermore National Laboratory, available here.

Your particular example is one that should not be implemented using OpenMP tasks. The second code creates N times the number of threads tasks (because there is an error in the code beside the missing }; I would come back to it later), and each task is only performing a very simple computation. The overhead of tasks would be gigantic, as you can see in my answer to this question. Besides the second code is conceptually wrong. Since there is no worksharing directive, all threads would execute all iterations of the loop and instead of N tasks, N times the number of threads tasks would get created. It should be rewritten in one of the following ways:

Single task producer - common pattern, NUMA unfriendly:

void saxpy_tasks(float* x, float* y, float a, int N) {
   #pragma omp parallel
   {
      #pragma omp single
      {
         for (int i = 0; i < N; i++)
            #pragma omp task
            {
               y[i] = y[i]+a*x[i];
            }
      }
   }
}

The single directive would make the loop run inside a single thread only. All other threads would skip it and hit the implicit barrier at the end of the single construct. As barriers contain implicit task scheduling points, the waiting threads will start processing tasks immediately as they become available.

Parallel task producer - more NUMA friendly:

void saxpy_tasks(float* x, float* y, float a, int N) {
   #pragma omp parallel
   {
      #pragma omp for
      for (int i = 0; i < N; i++)
         #pragma omp task
         {
            y[i] = y[i]+a*x[i];
         }
   }
}

In this case the task creation loop would be shared among the threads.

If you do not know what NUMA is, ignore the comments about NUMA friendliness.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • +1, Did not understand the -1, you must have encounter an hater. – dreamcrash Feb 13 '13 at 22:34
  • 1
    Neither do I, but that's what the real world delivers - haters, trolls, etc. :) – Hristo Iliev Feb 14 '13 at 10:14
  • What would be the difference in the NUMA friendly version and the same code without the "task" pragma? – Chris Feb 27 '13 at 17:20
  • It won't create a task for each iteration and will in general execute faster. The code is for illustration only. – Hristo Iliev Feb 27 '13 at 18:42
  • @HristoIliev Thank you for the answer. In your comment about the Single task producer you say "...hence start processing tasks immediately as they become available (because of the nowait clause)." This is true even without the nowait clause, am I right ? In this case, there is also an implicit barrier reached by other threads, but not at the end of the parallel region but at the end of single region, is it correct ? – Manuel Selva Apr 05 '16 at 08:44
  • 1
    @ManuelSelva, back then, when I wrote this particular answer, I still hadn't realised that the implicit barrier at the end of the single construct is also a task scheduling point. The case with `nowait` could allow the rest of the threads to do something else while the task producer thread is still producing tasks. – Hristo Iliev Apr 05 '16 at 10:54
  • @HristoIliev thank you for the answer and all the other ones you made on SO regarding OpenMP. Great, my understanding was correct :) – Manuel Selva Apr 05 '16 at 11:02
  • really nice answer! I assume that (in general) the more NUMA-friendly version should be faster on most modern machines, is that correct? – zeawoas Dec 06 '19 at 14:33
  • 1
    @zeawoas developments in modern architectures tend to decrease the difference between local and remote memory, but in general NUMA-aware code works faster. – Hristo Iliev Dec 07 '19 at 11:52
  • Tutorial link gives 404. Do you have an alternative link? – Ali Tou Dec 07 '22 at 03:00
  • @AliTou The tutorial has moved to a new location. I've updated the link. – Hristo Iliev Dec 08 '22 at 11:51