Set the number of blocks and threads in calling a device function in CUDA?

Question

I have a basic question about calling a device function from a global CUDA kernel. Can we specify the number of blocks and threads when I want to call a device function???

I post an question earlier about min reduction (here) and I want to call this function inside another global kernel. However the reduction code needs certain blocks and threads.

score 6 · Accepted Answer · answered Feb 13 '16 at 04:30

There are two types of functions that can be called on the device:

__device__ functions are like ordinary c or c++ functions: they operate in the context of a single (CUDA) thread. It's possible to call these from any number of threads in a block, but from the standpoint of the function itself, it does not automatically create a set of threads like a kernel launch does.

__global__ functions or "kernels" can only be called using a kernel launch method (e.g. my_kernel<<<...>>>(...); in the CUDA runtime API). When calling a __global__ function via a kernel launch, you specify the number of blocks and threads to launch as part of the kernel configuration (<<<...>>>). If your GPU is of compute capability 3.5 or higher, then you can also call a __global__ function from device code (using essentially the same kernel launch syntax, which allows you to specify blocks and threads for the "child" kernel). This employs CUDA Dynamic Parallelism which has a whole section of the programming guide dedicated to it.

There are many CUDA sample codes that demonstrate:

calling a __device__ function, such as simpleTemplates
calling a __global__ function from the device, such as cdpSimplePrint

Set the number of blocks and threads in calling a device function in CUDA?

1 Answers1