15

Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1:

In compute capability 2.1 devices, each SM has 48 SP (cores) for integer and floating point operations. Each warp is composed of 32 consecutive threads. Each SM has 2 warp schedulers. At every instruction issue time, one warp scheduler picks a ready warp of threads and issues 2 instructions for the warp on the cores.

My doubts:

  • One thread will execute on one core. How can the device issue 2 instructions to a thread in a single clock cycle or a single multi-cycle operation?
  • Does this mean the 2 instructions should be independent of each other?
  • That the 2 instructions can be executed in parallel on the core, maybe because they use different execution units in the core? Does this also mean that the warp will be ready next only after 2 instructions are finished executing or is it after one of them?
einpoklum
  • 118,144
  • 57
  • 340
  • 684
Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
  • 1
    It is classic instruction level parallelism. Remember a warp retires an instruction on 16 cores over a minimum of 2 cycles on Fermi. Compute 2.1 hardware has a "spare" 16 cores per SM that can handle a second instruction, if available, from either of the 2 concurrent warps per SM. If ILP is not possible, the instruction issue rate becomes a theoretical maximum of 2 instructions per 4 cycles, rather than 1 instruction per 2 cycles, as on compute 2.0 devices. – talonmies Mar 27 '12 at 08:56
  • Talonmies: Thanks for the explanation. Could you elaborate on the retiring and the theoretical maximum? Please add as an answer so I can accept it and others can edit it. – Ashwin Nanjappa Mar 27 '12 at 09:29
  • I didn't and won't add it as an answer because I don't know anything about the "computer architecture perspective" you are asking about. The theoretical maximum comes simply from how many cycles an instruction takes to retire. A single precision FMAD can retire in one cycle, but others may be/are slower than that. A 1 cycle-to-retire instruction on 16 cores takes 2 cycles to retire for a warp of 32 threads. That is the theoretical maximum instruction throughput. – talonmies Mar 27 '12 at 09:41
  • 2
    Be careful. The latency of an FMAD or any other instruction is much longer than 1 cycle. You can't invert throughput to get latency. It's 1 instruction per cycle, not 1 cycle per instruction. – harrism Mar 27 '12 at 09:59
  • I answered it -- I don't like to leave CUDA tag questions unanswered. – harrism Mar 27 '12 at 10:04
  • 1
    @harrism: yeah that wasn't very well worded. There is pipelining to worry about in the latency of a given instruction - I was thinking in terms only of hardware instruction throughput/retirement rate. The perils of taking hardware advice from an idiot, I guess. – talonmies Mar 27 '12 at 10:11

1 Answers1

24

This is instruction-level parallelism (ILP). The instructions issued from a warp simultaneously must be independent of each other. They are issued by the SM instruction scheduler to separate functional units in the SM.

For example, if there are two independent FMAD instructions in the warp's instruction stream that are ready to issue and the SM has two available sets of FMAD units on which to issue them, they can both be issued in the same cycle. (Instructions can be issued together in various combinations, but I have not memorized them so I won't provide details here.)

The FMAD/IMAD execution units in SM 2.1 are 16 SPs wide. This means that it takes 2 cycles to issue a warp (32-thread) instruction to one of the 16-wide execution units. There are multiple (3) of these 16-wide execution units (48 SPs total) per SM, plus special function units. Each warp scheduler can issue to two of them per cycle.

Assume the FMAD execution units are pipe_A, pipe_B and pipe_C. Let us say that at cycle 135, there are two independent FMAD instructions fmad_1 and fmad_2 that are waiting:

  • At cycle 135, the instruction scheduler will issue the first half warp (16 threads) of fmad_1 to FMAD pipe_A, and the first half warp of fmad_2 to FMAD pipe_B.
  • At cycle 136, the first half warp of fmad_1 will have moved to the next stage in FMAD pipe_A, and similarly the first half warp of fmad_2 will have moved to the next stage in FMAD pipe_B. The warp scheduler now issues the second half warp of fmad_1 to FMAD pipe_A, and the second half warp of fmad_2 to FMAD pipe_B.

So it takes 2 cycles to issue 2 instructions from the same warp. But as OP mentions there are two warp schedulers, which means this whole process can be done simultaneously for instructions from another warp (assuming there are sufficient functional units). Hence the maximum issue rate is 2 warp instructions per cycle. Note, this is an abstracted view for a programmer's perspective—the actual low-level architectural details may be different.

As for your question about when the warp will be ready next, if there are more instructions that don't depend on any outstanding (already issued but not retired) instructions, then they can be issued in the very next cycle. But as soon as the only available instructions are dependent on in-flight instructions, the warp will not be able to issue. However that is where other warps come in -- the SM can issue instructions for any resident warp that has available (non-blocked) instructions. This arbitrary switching between warps is what provides the "latency hiding" that GPUs depend on for high throughput.

Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
harrism
  • 26,505
  • 2
  • 57
  • 88
  • Would it be appropriate to also include something about the cores running at double the frequency of the schedulers in this answer? – Roger Dahl Mar 28 '12 at 02:19
  • The number of CUDA cores and the frequency of the execution units is not relevant to the answer. – Greg Smith Mar 28 '12 at 02:56
  • Harrism: The CUDA Programming Guide says that 2.0 warp scheduler issues an instruction to a warp as follows: first half-warp in one cycle and second half-warp in next cycle. This in itself is a bit confusing, but gets more confusing when I pull it up 2.1 scheduler which does 2 instructions per warp. Could you flesh out the details a bit in the answer? – Ashwin Nanjappa Mar 28 '12 at 04:08
  • 1
    Harrism: Thanks a lot! One of the best CUDA answers on StackOverflow. Accepted :-) – Ashwin Nanjappa Mar 29 '12 at 04:23