Should I check the number of threads in kernel code?

Question

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:

__global__ void myKernel(int nbThreads)
{
    int threadId = blockDim.x*blockIdx.y*gridDim.x  //rows preceeding current row in grid
            + blockDim.x*blockIdx.x             //blocks preceeding current block
            + threadIdx.x;

    if (threadId < nbThreads)
    {
        statement();
        statement();
        statement();
    }
}

They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time. However, I've seen no example kernel on the internet so far where they actually do this verification.

Can CUDA actually launch more threads than specified block/grid dimensions?

Robert Crovella · Accepted Answer · 2014-10-06T13:55:52.270

CUDA will not launch more threads than what are specified by the block/grid dimensions.

However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.

In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.

In your example, this could be clarified by writing:

__global__ void myKernel(int problem_size)


    if (threadId < problem_size)

which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.

As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.

If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

Should I check the number of threads in kernel code?

1 Answers1

Linked