There are two types of functions that can be called on the device:
__device__
functions are like ordinary c
or c++
functions: they operate in the context of a single (CUDA) thread. It's possible to call these from any number of threads in a block, but from the standpoint of the function itself, it does not automatically create a set of threads like a kernel launch does.
__global__
functions or "kernels" can only be called using a kernel launch method (e.g. my_kernel<<<...>>>(...);
in the CUDA runtime API). When calling a __global__
function via a kernel launch, you specify the number of blocks and threads to launch as part of the kernel configuration (<<<...>>>
). If your GPU is of compute capability 3.5 or higher, then you can also call a __global__
function from device code (using essentially the same kernel launch syntax, which allows you to specify blocks and threads for the "child" kernel). This employs CUDA Dynamic Parallelism which has a whole section of the programming guide dedicated to it.
There are many CUDA sample codes that demonstrate:
- calling a
__device__
function, such as simpleTemplates
- calling a
__global__
function from the device, such as cdpSimplePrint