1

On my GPU, with Compute Capability 2.0, the maximum number of threads per multiprocessor is 1536. Why is it not a power of 2?

Here are some details for my GPU:

Physical Limits for GPU Compute Capability: 2.0   
Threads per Warp                            32  
Max Warps per Multiprocessor                48  
Max Thread Blocks per Multiprocessor        8  
Max Threads per Multiprocessor              1536  
Maximum Thread Block Size                   1024  
Registers per Multiprocessor                32768  
Max Registers per Thread Block              32768  
Max Registers per Thread                    63  
Shared Memory per Multiprocessor (bytes)    16384  
Max Shared Memory per Block                 16384  
Register allocation unit size               64  
Register allocation granularity             warp  
Shared Memory allocation unit size          128  
Warp allocation granularity                 2  
apaderno
  • 28,547
  • 16
  • 75
  • 90

1 Answers1

7

Threads per Warp x Max Warps per Multiprocessor = Max Threads per Multiprocessor

32 x 48 = 1536

Max Warps per Multiprocessor actually means Maximum number of **resident** warps per multiprocessor, and Max Threads per Multiprocessor is Maximum number of **resident** threads per multiprocessor.

Check this out. In Table 14, you will see that the above rule applies to every compute capability.

The number 1536 means that each multiprocessor (called SM for Streaming Processor in cuda) can have maximum of 1536 active threads. It doesn't mean that you can only launch 1536 threads. You can launch much more than 1536 threads in a call to CUDA kernel, but each SM can only contain 1536 threads. Also, it doesn't mean that 1536 threads are physically executing at the same time. Warp is the unit of execution, which is 32 in all generations of CUDA up to today.

Following quote is from here.

By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (termed a warp of threads). Modern NVIDIA GPUs can support up to 1536 active threads concurrently per multiprocessor (see Features and Specifications of the CUDA C Programming Guide) On GPUs with 16 multiprocessors, this leads to more than 24,000 concurrently active threads.


EDIT

The additional question is:

Could you also highlight why the Max Warps per Multiprocessor is 48 and not a power of 2 (since the number of cores and register size = 65536 bytes are all powers of two)?

The number of cores per SM is not always a power of two. Also there's some subtle difference between a CPU core and a CUDA core. Take devices with compute capability 3.x for example(link).

A multiprocessor consists of:

  • 192 CUDA cores for arithmetic operations,
  • 32 special function units for single-precision floating-point transcendental functions,
  • 4 warp schedulers.

As you can see, the number of CUDA cores(192) is not a power of 2, and whereas a CPU core is general, a CUDA core doesn't perform single-precision floating-point transcendental functions. Those operations are handled by some other special function units. Check this out.

Also, in your question it says Registers per Multiprocessor is 32K. It means there are 32K 32-bit registers per SM. So the total register size is 128KB.

Given all that, I don't think there's a reason for the Max Warps per Multiprocessor to be a power of 2.

Community
  • 1
  • 1
nglee
  • 1,913
  • 9
  • 32
  • Thank you for the explanation. Could you also highlight why the Max Warps per Multiprocessor is 48 and not a power of 2 (since the number of cores and register size = 65536 bytes are all powers of two). – user2118922 May 10 '17 at 22:50
  • @user2118922 Answer edited to reflect this additional question. – nglee May 11 '17 at 07:07
  • 1
    thank you. I had some lecture slides where these max limits and diagram did not match and caused lot of confusion. I am using compute capability 2.0 (and not 2.1) and the [link](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#architecture-2-x) suggests that it has 32 CUDA cores for arithmetic operations. but the _Maximum number of resident warps per multiprocessor_ is still 48 (not a power of 2). (from [link](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities). I now understand that this limit is for 2.x (not just 2.0) – user2118922 May 13 '17 at 05:07