NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule instruction execution for:
- threads (0,0,0) through (31,0,0)
- threads (32,0,0) through (63,0,0)
- threads (64,0,0) through (89,0,0) + 6 unused lanes
This is straightforward and obvious. But - what happens if we have a multi-dimensional block, whose X dimension is not a multiple of 32? Say, X x Y x Z = 30 x 3 x 1 ? There are at least two intuitive ways these could be broken up into warps.
Option 1 - pack threads into full warps:
- threads (0,0,0) through (29,0,0) + (0,1,0) through (1,1,0)
- threads (2,1,0) through (30,1,0) + (0,2,0) through (3,2,0)
- threads (4,2,0) through (30,2,0) + 6 unused lanes
Option 2 - keep threads with different z, y coordinates in separate warps:
- threads (0,0,0) through (29,0,0) + 2 unused lanes
- threads (0,1,0) through (29,1,0) + 2 unused lanes
- threads (0,2,0) through (29,2,0) + 2 unused lanes
The first option potentially requires less warps (think of the case of 16 x 2 x 1 blocks); the second option is likely to prevent some divergence within warps - although this depends on the specifics of the kernel code.
My questions:
- If I don't try to specify anything about the aggregation into warps - which option is chosen by default? And does this differ by GPU/driver?
- Can I affect which of the two options is chosen, or otherwise affect the aggregation of threads into warps in a multidimensional block?