I read NVIDIA Fermi whitepaper and get confused when I calculated the number of SP cores, schedulers.
According to the whitepaper, in each SM, there are two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. There are 32 SP cores in a SM, each core has a fully pipelined ALU and FPU, which is used to execute the instruction of a thread
As we all know, a warp is made up by 32 threads, if we just issue a warp each cycle, that means all threads in this warp will occupy all SP cores and will finish the execution in one cycle(suppose there is no any stall).
However, NVIDIA devise dual scheduler, which select two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.
NVIDIA said this design lead to peak hardware performance. Maybe the peak hardware performance comes from interleaving execution of different instruction, taking full advantage of hardware resources.
My questions are as follows(suppose no memory stalls and all operands are available):
Does each warp need two cycles to finish execution and all 32 SP cores are divided into two groups for each warp scheduler?
the ld/st and SFU units are shared by all the warps(looks like uniform for warps from dual schedulers)?
if a warp is divided into two parts, which part is scheduled first? is there any scheduler? or just random selects one part to execute.
what is the advantage of this design? just maximize the utilization of hardware?