3

I frequently found the following words in some CUDA materials:

"At any time, only one of the warps is executed by a SM".

Here I don't quite understand since each SM can run hundreds to thousands of threads simultaneously, why only a single warp, which is 32 threads, can be executed at a time point?

Thanks!

Hailiang Zhang
  • 17,604
  • 23
  • 71
  • 117

2 Answers2

4

Details vary for different generations of CUDA hardware, but for example in earlier generations each SM has 8 execution units, each of which executes 4 threads (one instruction from each thread every 4 cycles). Hence you get 4 way SMT which gives 32 concurrent threads per SM.

Of course there are multiple SMs per GPU, e.g. 30, which would mean 30 x 32 thread warps = 960 threads executing at any given instant. On top of this warps can be switched in and out so you can have much more than, e.g. 960 "live" threads, even though only 960 of them are actually executing at any given time.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 6
    The granularity of warp execution looks slightly different for different generations of GPUs (GT200, Fermi, Kepler). This description has GT200 in view. In Fermi, each SM has 32 execution units, and so one warp is executed simultaneously. In Kepler, each SM has more the 32 execution units, and so multiple warps can be executed simultaneously ("at any given intant/time"), per SMX (per Kepler SM). But in all cases the definition of a warp is 32 threads executing in lockstep. – Robert Crovella Nov 19 '12 at 22:48
  • So if a block has more than 32 threads, it has to be sequentially loaded on the same SM every 32 threads, right? – Hailiang Zhang Nov 20 '12 at 02:38
  • 1
    A block that has been scheduled to execute is always resident on one and only one SM. Once a block begins to execute on an SM, it remains there. All threads of that block (grouped into warps) will execute on that SM, until the block is finished and retired. Since an SM has a limited number of execution units, it may be the case that warps may execute sequentially, or in some random order. When a warp stalls for any reason (e.g. a memory reference) the SM is free to schedule another available warp, from among the blocks that are resident on that SM. Also see Greg Smith's answer. – Robert Crovella Nov 20 '12 at 04:08
  • Thanks! But now if different warps from the same block are loaded sequentially (or in a random order), how could "__syncThreads()" synchronize those "sequential" warps? – Hailiang Zhang Nov 20 '12 at 05:20
  • 2
    __syncThreads() is a barrier. Once any warp hits that barrier, that warp is stalled and will be moved out of execution, and another warp will be selected by the warp scheduler to take it's place. If that new warp is from the same threadblock, then presumably it too will, at some point, hit the syncThreads barrier, and will be swapped out for another warp. Following this process, eventually all warps in the threadblock will reach (and stall at) the barrier. One all warps have stalled at the barrier, then any of the warps can then proceed beyond the barrier. – Robert Crovella Nov 20 '12 at 14:19
  • I think Paul means 8 cores each having 4 lanes (threads) and not execution units (which can be confused with ALUs). – G Gill Apr 27 '15 at 01:02
3

The statement is true of the Tesla architecture but it is incorrect for Fermi and Kepler. It is easier to look at the SM in terms of warp schedulers. On each cycle the warp scheduler selects an eligible warp (a warp that is not stalled) and dispatches one or two instructions from the warp to execution units. The number of execution units per SM is documented in the Fermi and Kepler whitepapers. CUDA cores roughly equate to the number of execution units that can perform integer and single precision floating point operations. There are additional execution units for load/store operations, branching, etc.

Compute Capability 1.x (Tesla)

  • 1 warp scheduler per SM
  • Dispatch 1 instruction per warp scheduler

Compute Capability 2.0 (Fermi 1st Generation)

  • 2 warp schedulers per SM
  • Dispatch 1 instruction per warp scheduler

Compute Capability 2.1 (Fermi 2nd Generation)

  • 2 warp schedulers per SM
  • Dispatch 1 or 2 instructions per warp scheduler

Compute Capability 3.x (Kepler)

  • 4 warp schedulers per SM
  • Dispatch 1 or 2 instructions per warp scheduler
Greg Smith
  • 11,007
  • 2
  • 36
  • 37
  • As all the threads in a wrap executes a single instruction at once then how could it be possible that a wrap scheduler can dispatch 2 instruction? – haccks Jun 28 '15 at 10:27
  • As long as the two instructions are independent and use different execution units then the warp scheduler can dispatch 2 instructions to the same warp. This is very common. Intel i7 can issue 5-7 instructions per cycle per core depending on the generation. For more information see this link on [Superscalar](https://en.wikipedia.org/wiki/Superscalar) CPU architectures. – Greg Smith Jun 30 '15 at 04:34
  • But all threads in a wrap executes a single instruction at a time then dispatching two independent instructions to the same wrap doesn't make any sense. – haccks Jun 30 '15 at 09:30
  • Instructions are scheduled at a warp level on a per cycle basis. SM2.1 and above warp schedulers can dispatch up to 2 independent instructions for the selected warp per cycle. This is stated in the Kepler and Maxwell white papers and the instruction pairing for Kepler and Maxwell should be shown in the CUDA 7.5 profilers. – Greg Smith Jul 02 '15 at 00:37