Are threads in a multi-dimensional CUDA kernel blocks packed to fill warps?

Question

NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule instruction execution for:

threads (0,0,0) through (31,0,0)
threads (32,0,0) through (63,0,0)
threads (64,0,0) through (89,0,0) + 6 unused lanes

This is straightforward and obvious. But - what happens if we have a multi-dimensional block, whose X dimension is not a multiple of 32? Say, X x Y x Z = 30 x 3 x 1 ? There are at least two intuitive ways these could be broken up into warps.

Option 1 - pack threads into full warps:

threads (0,0,0) through (29,0,0) + (0,1,0) through (1,1,0)
threads (2,1,0) through (30,1,0) + (0,2,0) through (3,2,0)
threads (4,2,0) through (30,2,0) + 6 unused lanes

Option 2 - keep threads with different z, y coordinates in separate warps:

threads (0,0,0) through (29,0,0) + 2 unused lanes
threads (0,1,0) through (29,1,0) + 2 unused lanes
threads (0,2,0) through (29,2,0) + 2 unused lanes

The first option potentially requires less warps (think of the case of 16 x 2 x 1 blocks); the second option is likely to prevent some divergence within warps - although this depends on the specifics of the kernel code.

My questions:

If I don't try to specify anything about the aggregation into warps - which option is chosen by default? And does this differ by GPU/driver?
Can I affect which of the two options is chosen, or otherwise affect the aggregation of threads into warps in a multidimensional block?

@RobertCrovella: I couldn't find anything about this in the CUDA programming guide. I wonder if this is discussed anywhere in the official documentation. — einpoklum, Jun 18 '22 at 15:36
It's covered [here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture). "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block." And if need be, follow the link back to thread hierarchy to learn what a thread ID is; don't make assumptions about what that is/means. Furthermore your question here is largely covered [here](https://stackoverflow.com/questions/15044671) — Robert Crovella, Jun 18 '22 at 17:32
if you feel the programming guide can be improved, you can always [file a bug](https://forums.developer.nvidia.com/t/how-to-report-a-bug/67911). — Robert Crovella, Jun 18 '22 at 17:45

einpoklum · Answer 1 · 2022-06-20T19:03:09.807

tl;dr: CUDA packs full warps.

Deducing this from the programming guide

(Thanks @RobertCrovella)

Section §4.1 of the CUDA Programming API says:

The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy (§2.2) describes how thread IDs relate to thread indices in the block.

Section §2.2 of the CUDA Programming API says:

The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy),the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).

So, the multi-dimensional "thread index" is linearized in a straightforward manner into a unidimensional "thread ID", and those are just packed into warps in-order.

Seeing this for yourself

You can check the partition-into-warps using the following program:

__global__ void test_kernel()
{
    unsigned active_lanes = __activemask();
    printf("Thread (%2u,%2u): Active lane mask %8X\n",
        threadIdx.x, threadIdx.y, active_lanes);
};

int main()
{
    cudaSetDevice(0);
    test_kernel<<<1,{31,2,1}>>>();
    cudaDeviceSynchronize();
}

If warps are fully packed, you'll have a warp's worth of threads with the full mask (0xFFFFFFFF), and 30 threads with a 30-thread mask (0x3FFFFFFF). Otherwise, the pattern will be different.

... and indeed, we get the first option.

"But I want option 2!"

Well, if you want separate warps for different Y and Z axis coordinates, is to "pad" your block dimensions so that the first (X-axis) dimension is always a multiple of the warp size, 32. This has, of course, the cost of having to perform an extra check:

if (threadIdx.x >= unpadded_x_block_size) { return; }

but that's not very expensive (especially if you use threadIdx.x elsewhere, and if you can calculate unpadded_x_block_size at kernel compile time.)

Are threads in a multi-dimensional CUDA kernel blocks packed to fill warps?

1 Answers1

tl;dr: CUDA packs full warps.

Deducing this from the programming guide

Seeing this for yourself

"But I want option 2!"