Conversion from block dimensions to warps in CUDA

Question

I'm a little confused regarding how blocks of certain dimensions are mapped to warps of size 32.

I have read and experienced first hand that the inner dimension of a block being a multiple of 32 improves performance.

Say I create a block with dimensions 16x16. Can a warp contain threads from two different y-dimensions, e.g. 1 and 2 ?

Why would having an inner dimension of 32 improve performance even though there technically are enough threads to be scheduled to a warp?

The mapping formula is spelled out in [the documentation](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy). There are many similar questions that explain mapping from 2D threadblocks to warps, like [this](http://stackoverflow.com/questions/15044671) or [this](http://stackoverflow.com/questions/14257550). An inner dimension of 32 may or may not help with performance, it depends on the actual access pattern and arrangement of underlying data. For many large 2D codes, there isn't that much performance difference between 16,16 and 32,8 threadblocks, for example. — Robert Crovella, Jul 07 '15 at 23:06

score 1 · Answer 1 · edited May 23 '17 at 12:31

Your biggest question has already been answered in About warp and threads and How are CUDA threads divided into warps?. So, I have focuses this answer in the why.

The blocksize in CUDA is always a multiple of the warp size. The warp size is implementation defined and the numbe 32 is mainly related to shared memory organization, data access patterns and data flow control [ 1 ].

So, a blocksize being a multiple of 32 does not improves performance but means that all the threads will be used for something. Note that used for something depends on what you do with the threads within the block.

A blocksize being not a multiple of 32 will rounds up to the nearest multiple, even if you request fewer threads. See GPU Optimization Fundamentals presentation of Cliff Woolley from the NVIDIA Developer Technology Group has interesting hints about performance.

In addition, memory operations and instructions are executed per warp, so you can understand the importance of this number. I think the reason why it is 32 and not 16 or 64 is undocumented. So I like remember the warp size as "The Answer to the Ultimate Question of Life, the Universe, and Everything" [ 2 ].

[1] David B Kirk and W Hwu Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Elsevier, 2010.

[2] The Hitchhiker's Guide to the Galaxy.

Conversion from block dimensions to warps in CUDA

1 Answers1