14

Let's take the nVidia Fermi Compute Architecture. It says:

The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each.

[...]

Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU).

[...]

In Fermi, the newly designed integer ALU supports full 32-bit precision for all instructions, consistent with standard programming language requirements. The integer ALU is also optimized to efficiently support 64-bit and extended precision operations. V

From what I know, and what is unclear for me, is that GPUs execute the threads in so called warps, each warp consists of ~32 threads. Each warp is assigned to only one core (is that true?). So does that mean, that each of the 32 cores of a single SM is a SIMD processor, where a single instruction handles 32 data portions ? If so, then why we say there are 32 threads in a warp, not a single SIMD thread? Why cores are sometimes referred to as scalar processors, not vector processors ?

Community
  • 1
  • 1
Marc Andreson
  • 3,405
  • 5
  • 35
  • 51

2 Answers2

21

Each warp is assigned to only one core (is that true?).

No, it's not true. A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).

Cores are in fact scalar processors, not vector processors. 32 cores (or execution units) are marshalled by the warp scheduler to execute a single instruction, across 32 threads, which is where the "SIMT" moniker comes from.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • So a single SM of 32 cores at one time can execute only one warp of 32 threads? If so, why do I often hear that a SM can *"handle up to 8 blocks at a time"* ? – Marc Andreson Feb 02 '15 at 18:34
  • 10
    I suppose this question could devolve into a complete tutorial on the mechanics of GPU execution. I'd encourage you to avail yourself of existing resources, as this is covering ground that has been covered many times, elsewhere. The warp schedulers in an SM can, at each instruction cycle, select a new instruction from *any available warp*. The available warps may come from any of the threadblocks currently resident on the SM. Instructions *need not come from the same warp or threadblock*, from one cycle to the next. The GPU is a latency-hiding machine and likes to have many available warps. – Robert Crovella Feb 02 '15 at 18:43
  • my question arose from the [video presentation](https://www.youtube.com/watch?v=KfGnLltyRH4&t=34m52s), which presents lots of threads in a single SM. So basically, GPU cores are scalar processors, a single warp spreads along entire SM, SM has several blocks assigned and execute warps sequentially, and the graphics from the presentation meant that the green dots are threads that are *active* but not necessarily *executed* at the same time? – Marc Andreson Feb 02 '15 at 18:51
  • 5
    The green dots in the slide represent hardware resources (execution units), not threads. The picture specifically indicates "GK110 Block Diagram". That is a hardware chip, not a collection of threads. A single warp, when one of its instructions is selected for execution, "spreads" across 32 execution units, for that clock-cycle. On the very next clock cycle, a different warp could be scheduled on those same execution units. SMs execute warps in some undefined order - based on the specific sequence of selection of instructions. – Robert Crovella Feb 02 '15 at 19:01
  • but "execution unit = core = one thread", isn't it? There are ~192 green dots in what the presenter calls *a single SM*, while GK110 is said to have 32 cores per SM. I don't get what is the difference :-( – Marc Andreson Feb 02 '15 at 19:08
  • the presenter says: *"each of the **threads** map to these **individual** green dots"* – Marc Andreson Feb 02 '15 at 19:16
  • 6
    GK110 does not have 32 cores per SM. That is not stated anywhere. (Perhaps you are confusing *Fermi* -- your initial whitepaper link -- with *Kepler* -- what is being discussed in the video.) GK110 has 192 cores per SM. Which is like saying there are 192 green doets in the single SM in the picture. "execution unit = core" YES. "core = one thread" NO. Does a CPU core equal a CPU thread? It does not. One is a hardware resource. The other is a logical collection of instructions. The threads (SW) *map* to green dots (HW) *when the warp scheduler places an instruction there*. – Robert Crovella Feb 02 '15 at 19:43
  • So this is what I misunderstood - each single green dot on the image is a *separate* core. Thanks. Just to clarify one thing: cores are neither vector processors nor super-scalar processors, so by saying *execution unit* you mean processing unit, not its internal subunits like ALUs, etc., one core=one instruction handled at a time in that unit ? – Marc Andreson Feb 02 '15 at 19:49
  • 4
    When I say execution unit I mean, for the purposes of this discussion, a core. The thing that a GK110 SM has 192 of is actually an execution unit that can retire 1 single precision floating point operation per clock (2 for for fused-multiply-add). There are other types of execution units -- in varying quantities per SM -- that are used for performing other types of operations, such as double precision floating point, transcendental functions, load/store operations, etc. The execution of an instruction sequence will typically not be carried out entirely by a single core or execution unit (type) – Robert Crovella Feb 02 '15 at 19:55
  • SM execution is arguably [super-scalar](http://en.wikipedia.org/wiki/Superscalar), and execution units are pipelined, which is a [separate discussion](http://stackoverflow.com/questions/28032470/are-gpu-kepler-cc3-0-processors-not-only-pipelined-architecture-but-also-supers/28032697#28032697). – Robert Crovella Feb 02 '15 at 20:06
  • In processor architecture, super-scalar and pipelining are orthogonal concepts. To pick some ancient examples, a 486 processor was pipelined, a Pentium processor pipelined and super-scalar. – njuffa Feb 02 '15 at 20:06
  • @njuffa, I've been editing my comment a few times now to get the semantics just right. The execution units in an SM are indisputably (in my opinion) pipelined. An individual execution unit is not superscalar. The Kepler SM as a whole, or the warp scheduler combined with various execution units within the SM, might be called superscalar, using [the wikipedia definition](http://en.wikipedia.org/wiki/Superscalar): "A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different functional units on the processor." – Robert Crovella Feb 02 '15 at 20:11
  • 4
    I believe a Kepler warp scheduler can execute 2 instructions from the same thread, in the same cycle, subject to certain limitations. Referring to [the GK110 whitepaper](http://www.nvidia.com/content/PDF/kepler/NVIDIA-kepler-GK110-Architecture-Whitepaper.pdf), "Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle" – Robert Crovella Feb 02 '15 at 21:02
12

CUDA "cores" can be thought of as SIMD lanes.

First let's recall that the term "CUDA core" is nVIDIA marketing-speak. These are not cores the same way a CPU has cores. Similarly, "CUDA threads" are not the same as the threads we know on CPUs.

The equivalent of a CPU core on a GPU is a "symmetric multiprocessor": It has its own instruction scheduler/dispatcher, its own L1 cache, its own shared memory etc. It is CUDA thread blocks rather than warps that are assigned to a GPU core, i.e. to a streaming multiprocessor. Within an SM, warps get selected to have instructions scheduled, for the entire warp. From a CUDA perspective, those are 32 separate threads, which are instruction-locked; but that's really no different than saying that a warp is like a single thread, which only executes 32-lane-wide SIMD instructions. Of course this isn't a perfect analogy, but I feel it's pretty sound. Something you don't quite / don't always have on CPU SIMD lanes is a masking of which lanes are actively executing, where inactive lanes will have not have the effect of active lanes' setting of register values, memory writes etc.

I hope this helps you makes intuitive sense of things.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • What does "instruction-locked" mean here? If there is 16 cores in the SM and warp scheduler can only issue 1 instruction per cycle, then warp scheduler needs 2 cycles to issues 32 threads in the warp, then could warp scheduler switch to other warp between these 2 cycles? – Thomson Sep 20 '19 at 19:10
  • 2
    An SM is a single core, physically. "CUDA cores" is just a marketing term. But regardless: The execution will proceed as-if all 32 threads executed simultaneously, i.e. the program will never perceive a state in which half of the warp has concluded one instruction and the other half has concluded the next one. – einpoklum Sep 20 '19 at 21:31