Max number of threads which can be initiated in a single CUDA kernel

Question

I am confused about the maximum number of threads which can be launched in a Fermi GPU.

My GTX 570 device query says the following.

  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

From my understanding, I understand the above statement as:

For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain up to 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.

Is this correct? What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?

After writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behave weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.

This may help you: http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s — user1154664, Aug 22 '12 at 17:44

score 25 · Accepted Answer · edited Nov 21 '14 at 05:43

For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain upto 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.

No this is not correct. You can launch a grid of up to 65535 x 65535 x 65535 blocks, and each block has a maximum of 1024 threads per block, although per thread resource limitation might restrict the total number of threads per block to less than this maximum.

What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?

No, you will not be able to reach the maximum threads per block in this case. Each release of the NVIDIA CUDA toolkit includes an occupancy calculator spreadsheet you can use to see the effect of register pressure on the limiting block size.

Also, after writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behace weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.

If you choose an illegal execution configuration (so incorrect block size or grid size) the kernel will not launch and the runtime will issue a cudaErrorInvalidConfiguration error message. You can use the standard cudaPeekAtLastError() and cudaGetLastError() to check the status of any kernel launch.

Is there a related function like `cudaOccupancyMaxPotentialBlockSize` which can return us the maximum number of grids and blocks that we can launch on any device? Something that abstracts the numbers away would be useful. — Mahyar Mirrashed, Jun 29 '22 at 19:10

Max number of threads which can be initiated in a single CUDA kernel

1 Answers1