How do the warps schedule on CUDA SMs?

Question

As the answer of this question shows, when a SM contains 8 CUDA cores（Compute Capability 1.3）, a single warp of 32 threads takes 4 clock cycles to execute a single instruction for the whole warp.

That is lane 1 to lane 8 of the warp concurrently running on the 8 cores, then lane 9 to lane 16 running,after that lane 17 to lane 24, finally lane 25 to lane 32.

Do I understand this correctly?

So my question is, on new devices,there are 32 (Compute Capability 2.0) or 48 (2.1) or 192 (3.0, Kepler) CUDA cores per SM, but the warp size is still 32.

How do the warp schedule on these new SMs?
Do the lane 1 to lane 32 running together or like the above mentioned lane 1 to lane 8, lane 9 to lane 16,... on the old CUDA SMs?

It takes 4 cycles to issue the instruction. The pipeline depth of the math pipe is much longer than 4 cycles. — Greg Smith, May 08 '14 at 05:24

score 15 · Answer 1 · answered May 08 '14 at 05:49

CUDA cores is the number of single precision floating point units in the SM. The SM has other execution units including special function units (RSQRT, COS, SIN, ...), double precision units, load store units, texture units, branch unit, etc.

The Fermi, Kepler-gk10x, Kepler-gk110 and Maxwell whitepapers contain additional information on the type and number of execution units in the SMs.

The instruction throughput of Arithmetic Instructions can be found in the CUDA Programming Guide in the Table of Throughput of Arithmetic Instructions.

As a developer you want to understand the rate an SM can issue instructions which is documented in the throughput table. The rate is determine by the throughput of the warp schedulers as well as the throughput of the execution units (again, not just the CUDA cores).

CC1.x Tesla

1 warp scheduler per SM
Each warp scheduler selects 1 eligible warp and issues 1 instruction per 4 cycles.

CC2.x Fermi

2 warp schedulers per SM
CC2.0 Each warp scheduler selects 1 eligible warp per tepid clock and issues 1 instruction.
CC2.x Each warp scheduler selects 1 eligible warp per tepid clock and issues up to 2 independent instructions.
The math pipes run at hot clock (2x tepid clock). This often results in people stating that instructions are issued over 2 clock cycles. It easier to think in terms of tepid clock.

CC3.* Kepler CC5.0 Maxwell

4 warp schedulers per SM
Each warp scheduler selects 1 eligible warp and issues up to 2 independent instructions.

Thanks for adding a canonical answer to this perennial question. I would encourage everyone to upvote it to make it visible in search. — talonmies, May 08 '14 at 05:58

How do the warps schedule on CUDA SMs?

1 Answers1