In CUDA compute capability 8.6, each Streaming Multiprocessor (SM) has four warp schedulers. Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently. However, in reality, the maximum number of resident warps per SM is only 48. This presents an inconsistency: doesn't this mean that the scheduling capacity of the warp schedulers will be wasted? Despite the warp schedulers being capable of scheduling 64 warps, in practice there are only 48 warps available for them to schedule. Could anyone clarify this?
UPDATE
Why do I think 'Each warp scheduler can schedule up to 16 warps concurrently, meaning that theoretically up to 64 warps could be running concurrently'? Because in the Ampere Tuning Guide, the documentation states: "The maximum number of concurrent warps per SM remains the same as in Volta (i.e., 64)." Doesn't this imply that each warp scheduler can schedule up to 16 warps concurrently?