7

I read NVIDIA Fermi whitepaper and get confused when I calculated the number of SP cores, schedulers.

According to the whitepaper, in each SM, there are two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. There are 32 SP cores in a SM, each core has a fully pipelined ALU and FPU, which is used to execute the instruction of a thread

As we all know, a warp is made up by 32 threads, if we just issue a warp each cycle, that means all threads in this warp will occupy all SP cores and will finish the execution in one cycle(suppose there is no any stall).

However, NVIDIA devise dual scheduler, which select two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.

NVIDIA said this design lead to peak hardware performance. Maybe the peak hardware performance comes from interleaving execution of different instruction, taking full advantage of hardware resources.

My questions are as follows(suppose no memory stalls and all operands are available):

  1. Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?

  2. the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?

  3. if a warp is divided into two parts, which part is scheduled first? is there any scheduler? or just random selects one part to execute.

  4. what is the advantage of this design? just maximize the utilization of hardware?

Dongwei Wang
  • 475
  • 5
  • 14

1 Answers1

7

Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?

Yes. Fermi, unlike future generations, has a "hotclock" (shader clock) which runs at 2x the "core" clock. Each single precision floating point instruction (for example) issues over 2 "hotclocks", but to the same group of 16 SP cores. The net effect is one issue per "core" clock per scheduler.

the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?

Don't really understand the question. All execution resources are shared/available for instructions coming from either scheduler.

if a warp is divided into two parts, which part is scheduled first? is there any scheduler? or just random selects one part to execute.

Why does this matter? The machine behaves as if two complete warp instructions are scheduled in one core clock i.e. "dual issue". You don't have visibility into anything happening at the hotclock level anyway.

what is the advantage of this design? just maximize the utilization of hardware?

Yes, as stated in the fermi whitepaper:

" Using this elegant model of dual-issue, Fermi achieves near peak hardware performance. "

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • For Q1, what I understand is: scheduler select warp at SM clock, however, SP runs 2x faster, so two warps can finish their execution in two cycles. But, only SP runs 2x faster than SM clock? how about other function units? such as ld/st, L1 cache( data cache, shared memory, constant memory and texture memory). – Dongwei Wang May 05 '16 at 01:39
  • 1
    all schedulable resources run at hotclock. because *all instruction issue* happens on a half-warp basis at hotclock. – Robert Crovella May 05 '16 at 01:45
  • For Q2, Fermi whitepaper says "Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs."(page 10). that means all resources are divided into groups, so the intuition is 16 SP cores, as a group, process one warp. 16 ld/st units, as a group, process memory request. suppose each warp has 8 memory requests, that means all 16 ld/st units can only process these two warp one by one in two cycles, not process them together just in one cycle. – Dongwei Wang May 05 '16 at 01:45
  • 1
    "suppose each warp has 8 memory requests" is not possible. Any given instruction in the thread stream is issued on a warp-wide basis. If a given thread is issuing a LD (read from memory) instruction in a given (core) clock cycle, then **every thread in the warp** is also issuing a LD instruction in that (core) clock cycle. Or if you prefer, the same statement can be made at the half-warp/hotclock level: if a given thread is executing a LD instruction at a particular hotclock cycle, then **every thread in that half-warp** is also executing a LD instruction in that cycle. – Robert Crovella May 05 '16 at 01:49
  • For Q3, I think it matters. "hotclock" is invisible, it relates to warp scheduling, we know warp execute 32 threads, however, all these 32 warps can not behave totally same( if they are, it does not matters). Such as some memory requests instructions, it involves memory access latency, if we can schedule them in a memory friendly fashion, maybe we can hide some latency. Do you agree? – Dongwei Wang May 05 '16 at 01:50
  • all 32 threads in a warp behave **"totally the same"** in any given core clock cycle. That core clock cycle will be composed of 2 hotclock cycles, during which each half-warp will issue the instruction associated with that issue slot. – Robert Crovella May 05 '16 at 01:52
  • For the reply of Q1, I am clear, that all SP cores, ld/st unit run with hotclock, however, warp scheduler runs at SM clock. – Dongwei Wang May 05 '16 at 12:53
  • For Q2, I am wrong, all the threads in a warp execute same instruction, so all of them should send memory requests if it is a ld/st instruction, so all the ld/st unit will be occupied. – Dongwei Wang May 05 '16 at 12:53
  • For the reply of Q3, maybe I did not make things clear. what I mean is: all threads in a warp behave totally the same, a instruction will finish execution in one SM cycle if it is a int/float option(all operands are available), however, how about it is a ld/st instruction, all threads will not get the same response, so they may have different stall time. that is why I believe the scheduling order matters. – Dongwei Wang May 05 '16 at 12:53