Do modern nVIDIA GPUs perform sub-warp scheduling of work?

Question

In recent nVIDIA GPU uarchitectures, a single streaming multiprocessor seems to be broken up into 4 sub-units; with each of them having horizontal or vertical 'bars' of 8 'squares', corresponding to different functional units: integer ops, 32-bit flops, 64-bit flops, and load/store. A single warp scheduler seems to be associated with each such "quarter-SM".

Now, in the CUDA programming model, the threads of each warp (= 32 threads) are instruction-locked together. However, when actually executing work, and in a situation where, say, only the second half or latter quarter of the threads in a warp are active - can these sub-warps be scheduled to 2 or 3 quarter-SMs, with the other quarter-SM doing some other work?

currently, in CUDA, all instructions, regardless of active-masking or predication, even on volta, are scheduled warp-wide. — Robert Crovella, Jan 04 '18 at 16:41
@RobertCrovella: So is the 'partition' in the diagram merely describing a "geometrical" arrangement of the chips? Or is the whole-warp scheduling a kind of a choice which theoretically could have been made differently? — einpoklum, Jan 04 '18 at 20:29
None of this is a remotely new idea. The original G80 had 8 "cores" per SM, and two SMs per TPC sharing a texture and memory interface with a warp size of 32 and warp level scheduling. So there were various combinations of "quarter warp" and "half warp" transactions to retire a single instruction. But always with warp wide instruction scheduling/issue — talonmies, Jan 05 '18 at 06:26
@RobertCrovella: Can this still be said to be the case right now? — einpoklum, Jun 17 '22 at 21:08
The GPU has always been able to "schedule work for a sub warp" to be able to handle conditional code. In that case instructions are issued warp wide, and thus use a full warp of resources, but some threads are disabled. As far as that goes nothing has changed. Even with the volta execution model, the compiler has more opportunity to explore partial warp dispatch, but instructions and resources are still issued warp wide. — Robert Crovella, Jun 17 '22 at 23:22
When an SM is partitioned as you see here, for newer GPUs, warps are statically assigned to partitions at the point at which the threadblock is deposited on the SM. For a warp assigned to a partition, the warp scheduler for that partition will schedule instructions for that entire warp, using resources from that partition (only). And again, like the comment above says, even within the partition, there is no partial-warp scheduling. There may be partial warp *execution* due to inactive threads due to conditional code, but there is no partial warp *scheduling*. — Robert Crovella, Jun 18 '22 at 14:39
Then perhaps you want to consider commenting [here](https://stackoverflow.com/questions/46467011/thread-synchronization-with-syncwarp) on this matter. — einpoklum, Jun 21 '22 at 13:52

einpoklum · Answer 1 · 2018-01-06T17:57:45.607

1

No, they don't.

Based on Robert's comments, sub-warp scheduling does not happen - scheduling is always of full warps (at least as far as anyone using the chip is concerned). Internally it may or may not be the case that sub-warp scheduling is possible.

edited Jan 06 '18 at 17:57

answered Jan 05 '18 at 16:08

einpoklum

118,144
57
340
684

Do modern nVIDIA GPUs perform sub-warp scheduling of work?

1 Answers1

No, they don't.

Linked