Threads per Warp
x Max Warps per Multiprocessor
= Max Threads per Multiprocessor
32 x 48 = 1536
Max Warps per Multiprocessor
actually means Maximum number of **resident** warps per multiprocessor
, and Max Threads per Multiprocessor
is Maximum number of **resident** threads per multiprocessor
.
Check this out. In Table 14, you will see that the above rule applies to every compute capability.
The number 1536 means that each multiprocessor (called SM for Streaming Processor in cuda) can have maximum of 1536 active threads. It doesn't mean that you can only launch 1536 threads. You can launch much more than 1536 threads in a call to CUDA kernel, but each SM can only contain 1536 threads. Also, it doesn't mean that 1536 threads are physically executing at the same time. Warp is the unit of execution, which is 32 in all generations of CUDA up to today.
Following quote is from here.
By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.
EDIT
The additional question is:
Could you also highlight why the Max Warps per Multiprocessor is 48 and not a power of 2 (since the number of cores and register size = 65536 bytes are all powers of two)?
The number of cores per SM is not always a power of two. Also there's some subtle difference between a CPU core and a CUDA core. Take devices with compute capability 3.x for example(link).
A multiprocessor consists of:
- 192 CUDA cores for arithmetic operations,
- 32 special function units for single-precision floating-point transcendental functions,
- 4 warp schedulers.
As you can see, the number of CUDA cores(192
) is not a power of 2, and whereas a CPU core is general, a CUDA core doesn't perform single-precision floating-point transcendental functions. Those operations are handled by some other special function units. Check this out.
Also, in your question it says Registers per Multiprocessor
is 32K. It means there are 32K 32-bit registers per SM. So the total register size is 128KB.
Given all that, I don't think there's a reason for the Max Warps per Multiprocessor
to be a power of 2.