1

In CUDA compute capability 8.6, each Streaming Multiprocessor (SM) has four warp schedulers. Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently. However, in reality, the maximum number of resident warps per SM is only 48. This presents an inconsistency: doesn't this mean that the scheduling capacity of the warp schedulers will be wasted? Despite the warp schedulers being capable of scheduling 64 warps, in practice there are only 48 warps available for them to schedule. Could anyone clarify this?

UPDATE

Why do I think 'Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently'? Because in the Ampere Tuning Guide, the documentation states: "The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64)." Doesn't this imply that each warp scheduler can schedule up to 16 warps concurrently?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Tokubara
  • 392
  • 3
  • 13
  • "Each warp scheduler can schedule up to 16 warps concurrently" Where is that written specifically for cc8.6? Not in the programming guide and not in the GA 102 whitepaper, that I can see. Anyway, I'm not sure what there is to clarify. The cc8.6 SM is limited to 1536 threads or 48 warps. The warps are statically assigned to the 4 schedulers. There wouldn't be a situation where any scheduler would have more than 12 warps. I think the obvious conclusion is that a cc8.6 warp scheduler would never be expected to have more than 12 warps, and is designed that way, and there is no "waste". – Robert Crovella Jul 07 '23 at 16:49

1 Answers1

3

As @RobertCrovella points out - your second sentence is incorrect. It is not the case that each warp scheduler "can schedule up to 16 warps".

Looking at the Ampere microarchitecture white paper or the relevant section the CUDA programming guide (for CC 8.x) - we don't see mention of the number of warps a scheduler handles. We do read, though, that the SM is made up of 4 partitions, each of which having its own scheduler; and that warps are distributed, on reception, "among the schedulers", hence among the partitions. So, it stands to reason to conclude that if an SM can have 48 resident warps, each warp partition (or "processing block") can have up to 12 resident warps, and that's the number each scheduler can handle.

Part of the mixup may be in that the Ampere Tuning Guide may be referring to the number of resident warps on A100 GPUs (CC 8.0) rather than other Ampere GPUs (with CC 8.6). The former can have up to 64 resident SMs per warp, the latter only 48.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • Thank you for your response, but I am still somewhat confused. In the [Ampere Tuning Guide](https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html), the documentation states: "The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64)." Doesn't this imply that each warp scheduler can schedule up to 16 warps concurrently? – Tokubara Jul 08 '23 at 01:55
  • 2
    Ah, so, that's for A100 (CC 8.0). For other Ampere GPUs, it's 48. – einpoklum Jul 08 '23 at 10:57