I am confused about the maximum number of threads which can be launched in a Fermi GPU.
My GTX 570 device query says the following.
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
From my understanding, I understand the above statement as:
For a CUDA kernel we can launch at most 65536 blocks. Each launched block can contain up to 1024 threads. Hence in principle, I can launch up to 65536*1024 (=67108864) threads.
Is this correct? What if my thread uses a lot registers? Will we still be able to reach this theoretical maximum of the number of threads?
After writing and launching the CUDA kernel, how do I know that the number of threads and blocks that I have launched have indeed been instantiated. I mean I dont want the GPU to calculate some junk, or behave weirdly, if I have by chance instantiated more threads than are possible for that particular kernel.