difference between kernels and parallel directives in OpenAcc standard

Question

I have already been using the PGI compiler supporting OpenAcc for launching the code on GPU for about 3 years, but i can not understand the difference between the terms "kernels" and "parallel" so far. I read in OpenAcc Getting Started Guide:

Parallel Construct

Defines the region of the program that should be compiled for parallel execution on the accelerator device.

Kernels Construct

Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.

I do not understand what is the difference between the terms "parallel execution on the accelerator device" and "compiled into a sequence of kernels for execution on the accelerator device". If the accelerator device is a GPU, then all the code is compiled into CUDA kernels of some size (i try to mean CUDA grid and blocks) and these CUDA kernels are executed on GPU in CUDA threads, isn't it? What is a "sequence" of kernels? The "parallel" directive makes 1 kernel and "kernels" can make a sequence of kernels from the same piece of code?

Also i use only the "parallel" loop directive everywhere. For example, to parallelize a for loop for execution on GPU, i write

#pragma acc parallel loop gang vector copy(...) present(...)
  for(int i=0; i<N; ++i)
  {
    ...
  }

Is it correct? When "kernels" should be used? Or it is a synonym for "parallel" and now is deprecated and not used?

Gambitier · Answer 1 · 2019-12-29T11:34:07.180

Already posted my answer here, but here it goes again.

Parallel Construct

Defines the region of the program that should be compiled for parallel execution on the accelerator device.
The parallel loop directive is an assertion by the programmer that it is both safe and desirable to parallelize the affected loop. This relies on the programmer to have correctly identified parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the programmer asserts incorrectly that the loop may be parallelized then the resulting application may produce incorrect results.
The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator. So it does not rely heavily on the compiler’s ability to automatically parallelize the code.
When parallel loop is used on two subsequent loops that access the same data a compiler may or may not copy the data back and forth between the host and the device between the two loops.
More experienced parallel programmers, who may have already identified parallel loops within their code, will likely find the parallel loop approach more desirable.

e.g refer

#pragma acc parallel
{
    #pragma acc loop
    for (i=0; i<n; i++) 
         a[i] = 3.0f*(float)(i+1);
    #pragma acc loop
    for (i=0; i<n; i++) 
         b[i] = 2.0f*a[i];
}

 Generate one kernel

 There is no barrier between the two loops: the second loop may start before the first loop ends. (This is different from OpenMP).

Kernels Construct

Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.
An important thing to note about the kernels construct is that the compiler will analyze the code and only parallelize when it is certain that it is safe to do so. In some cases, the compiler may not have enough information at compile time to determine whether a loop is safe the parallelize, in which case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely parallel.
The kernels construct gives the compiler maximum leeway to parallelize and optimize the code how it sees fit for the target accelerator but also relies most heavily on the compiler’s ability to automatically parallelize the code.
One more notable benefit that the kernels construct provides is that if multiple loops access the same data it will only be copied to the accelerator once which may result in less data motion.
Programmers with less parallel programming experience or whose code contains a large number of loops that need to be analyzed may find the kernels approach much simpler, as it puts more of the burden on the compiler.

e.g refer

#pragma acc kernels
{
   for (i=0; i<n; i++)
       a[i] = 3.0f*(float)(i+1);
   for (i=0; i<n; i++)
        b[i] = 2.0f*a[i];
}

 Generate two kernels

 There is an implicit barrier between the two loops: the second loop will start after the first loop ends.

score 0 · Accepted Answer · answered Dec 17 '19 at 15:45

The best way to think of the difference is that with "parallel", you the programmer are defining which loops to parallelize and how. Basically you're telling the compiler to parallelize particular loops. With "kernels", you're defining a region of code which may be parallelized but it's the compiler's job to then determine which loops to parallelize and how.

For "parallel", all code within the region is offloaded as one CUDA kernel. If you have multiple outer loops within the "parallel" region, they will still be offloaded in one CUDA kernel. Since the compiler can discover the parallelize with "kernels", multiple loops within this region may be split into a sequence of separate CUDA kernel launches.

Full details can be found at: https://www.pgroup.com/lit/articles/insider/v4n2a1.htm

Note that access to the article does require you to have a PGI Web User account.

The example above was in C, but it's worth nothing that `kernels` is also the way you would get array syntax to run on the accelerator (since there's no loop to attach a `parallel loop` to). I've also seen cases where the compiler is more creative with its parallelization scheme when using `kernels`, but that's not a universal thing. Mat nailed the fundamental difference, which is who takes responsibility for ensuring the code can safely be parallelized. — jefflarkin, Dec 18 '19 at 15:02

difference between kernels and parallel directives in OpenAcc standard

2 Answers2