Understanding Warp Scheduler Utilization in CUDA: Maximum Concurrent Warps vs Resident Warps

Question

In CUDA compute capability 8.6, each Streaming Multiprocessor (SM) has four warp schedulers. Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently. However, in reality, the maximum number of resident warps per SM is only 48. This presents an inconsistency: doesn't this mean that the scheduling capacity of the warp schedulers will be wasted? Despite the warp schedulers being capable of scheduling 64 warps, in practice there are only 48 warps available for them to schedule. Could anyone clarify this?

UPDATE

Why do I think 'Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently'? Because in the Ampere Tuning Guide, the documentation states: "The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64)." Doesn't this imply that each warp scheduler can schedule up to 16 warps concurrently?

"Each warp scheduler can schedule up to 16 warps concurrently" Where is that written specifically for cc8.6? Not in the programming guide and not in the GA 102 whitepaper, that I can see. Anyway, I'm not sure what there is to clarify. The cc8.6 SM is limited to 1536 threads or 48 warps. The warps are statically assigned to the 4 schedulers. There wouldn't be a situation where any scheduler would have more than 12 warps. I think the obvious conclusion is that a cc8.6 warp scheduler would never be expected to have more than 12 warps, and is designed that way, and there is no "waste". — Robert Crovella, Jul 07 '23 at 16:49

einpoklum · Accepted Answer · 2023-07-12T12:38:31.260

As @RobertCrovella points out - your second sentence is incorrect. It is not the case that each warp scheduler "can schedule up to 16 warps".

Looking at the Ampere microarchitecture white paper or the relevant section the CUDA programming guide (for CC 8.x) - we don't see mention of the number of warps a scheduler handles. We do read, though, that the SM is made up of 4 partitions, each of which having its own scheduler; and that warps are distributed, on reception, "among the schedulers", hence among the partitions. So, it stands to reason to conclude that if an SM can have 48 resident warps, each warp partition (or "processing block") can have up to 12 resident warps, and that's the number each scheduler can handle.

Part of the mixup may be in that the Ampere Tuning Guide may be referring to the number of resident warps on A100 GPUs (CC 8.0) rather than other Ampere GPUs (with CC 8.6). The former can have up to 64 resident SMs per warp, the latter only 48.

Thank you for your response, but I am still somewhat confused. In the [Ampere Tuning Guide](https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html), the documentation states: "The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64)." Doesn't this imply that each warp scheduler can schedule up to 16 warps concurrently? — Tokubara, Jul 08 '23 at 01:55
Ah, so, that's for A100 (CC 8.0). For other Ampere GPUs, it's 48. — einpoklum, Jul 08 '23 at 10:57

Understanding Warp Scheduler Utilization in CUDA: Maximum Concurrent Warps vs Resident Warps

1 Answers1