Interpreting compute workload analysis in Nsight Compute

Question

Compute Workload Analysis displays the utilization of different compute pipelines. I know that in a modern GPU, integer and floating point pipelines are different hardware units and can execute in parallel. However, it is not very clear which pipeline represents which hardware unit for the other pipelines. I also couldn't find any documentation online about abbreviations and interpretations of the pipelines.

My questions are:

1) What are the full names of ADU, CBU, TEX, XU? How do they map to the hardware?

2) Which of the pipelines utilize the same hardware unit(e.g. FP16, FMA, FP64 uses floating point unit)?

3) A warp scheduler in a modern GPU can schedule 2 instructions per cycle(using different pipelines). Which pipelines can be used at the same time(e.g FMA-ALU, FMA-SFU, ALU-Tensor etc.)?

P.s.: I am adding the screenshot for those who are not familiar with Nsight Compute.

All of this can be answered by reading the official whitepaper of the architecture you are using — talonmies, Apr 24 '20 at 17:42
I am interested in Volta and Turing. Even though the whitepapers provide a great deal of information, Turing whitepaper only partially answers the question 3. There is no information regarding the first 2 questions in both whitepapers, neither in the Nsight Compute documentation. — heapoverflow, Apr 24 '20 at 18:05

score 10 · Accepted Answer · answered Apr 25 '20 at 19:40

The Volta (CC 7.0) and Turing (CC 7.5) SM is comprised of 4 sub-partitions (SMSP). Each sub-partition contains

warp scheduler
register file
immediate constant cache
execution units
- ALU, FMA, FP16, UDP (7.5+), and XU
- FP64 on compute centric parts (GV100)
- Tensor units

The contains several other partitions that contains execution units and resources shared by the 4 sub-partitions including

instruction cache
index constant cache
L1 data cache that is partitioned into tagged RAM and shared memory
execution units
- ADU, LSU, TEX
- On non-compute parts FP64 and Tensor may be implemented as a shared execution unit

In Volta (CC7.0, 7.2) and Turing (CC7.5) each SM sub-partition can issue 1 instruction per cycle. The instruction can be issued to a local execution unit or the SM shared execution units.

ADU - Address Divergence Unit. The ADU is reponsible per thread address divergence handling for branches/jumps and indexed constant loads prior to instructions being forwarded to other execution units.
ALU - Arithmetic Logic Unit. The ALU is responsible for execution of most integer instructions, bit manipulation instructions, and logic instructions.
CBU - Convergence Barrier Unit. The CBU is repsonsible for barrier, convergence, and branch instructions.
FMA - Floating point Multiply and Accumulate Unit. The FMA is responsible for most FP32 instructions, integer multiply and accumulate instructions, and integer dot product.
FP16 - Paired half-precision floating point unit. The FP16 unit is responisble for execution of paired half-precision floating point instructions.
FP64 - Double precision floating point unit. The FP64 unit is responsible for all FP64 instructions. FP64 is often implemented as several different pipes on NVIDIA GPUs. The throughput varies greatly per chip.
LSU - Load Store Unit. The LSU is responsible for load, store and atomic instructions to global, local, and shared memory.
Tensor (FP16) - Half-precision floating point matrix multiply and accumulate unit.
Tensor (INT) - Integer matrix multiply and accumulate unit.
TEX - Texture Unit. The texture unit is responsible for sampling, load, and filtering instructions on textures and surfaces.
UDP (Uniform) - Uniform Data Path - A scalar unit used to execute instructions where input and output is identical for all threads in a warp.
XU - Transcendental and Data Type Conversion Unit - The XU is responsible for special functions such as sin, cos, and reciprocal square root as well as data type conversions.

Thank you very much for the very informative answer. Two follow up questions: 1) In Turing it is possible to execute FMA and ALU concurrently, does this mean that they are scheduled one cycle apart and the rest of both pipelines executing concurrently? 2) Is FP16 using the same unit as FMA, because both whitepapers state that there are 64 FP32 and 64 INT32 units in each SM. — heapoverflow, Apr 25 '20 at 20:25
1) In Volta-Turing the FMA, ALU, and FP16 pipes can execute an instruction (per SM sub-partition) every other cycle. The warp scheduler can alternate issuing instructions every cycle (ALU, FMA, ALU, FMA). — Greg Smith, Apr 30 '20 at 21:22
2) In Volta-Turing the FP16x2 pipe is independent of the FMA pipe. The whitepaper does not show all of the pipes listed above. The implementation can vary between architectures and chips in the same architecture (e.g. tensor pipes or fp64 pipe in HPC/DL focused chips vs. graphics focused chips). — Greg Smith, Apr 30 '20 at 21:25

Interpreting compute workload analysis in Nsight Compute

1 Answers1

Linked