Which GPU execution dependencies have fixed latency (causing 'Wait' stalls)?

Question

With recent NVIDIA micro-architectures, there's a new (?) taxonomy of warp stall reasons / warp scheduler states. One of these is:

Wait : Warp was stalled waiting on a fixed latency execution dependency.

As @GregSmith explains, fixed-latency instructions are: "Math, bitwise [and] register movement". But what are fixed-latency "execution dependencies"? Are these just "waiting for somebody else's fixed-latency instruction to conclude before we can issue it ourselves"?

@RobertCrovella: As far as classification of _instructions_, then yes. But are "FOO execution dependencies" just "waiting on execution of instructions of type FOO"? If so, I'll close this as a dupe. — einpoklum, Mar 14 '21 at 14:26
Execution dependencies are inputs to the next instruction including register operands and predicates. If you had a chain of IADD operations IADD r0, r1, r2; IADD r4, r0, r3. The second instruction is r4 = r0 + r3. r0 is the output of the first IADD. In this case the warp is stalled until the first instruction has completed. Since IADD is a fixed latency instruction the compiler can state the minimal cycles the scheduler has to wait between issuing the first IADD and the second IADD. During these cycles the warp is stalled on "wait" reason. — Greg Smith, Mar 15 '21 at 00:51

score 1 · Answer 1 · answered Mar 19 '21 at 15:08

Execution dependencies are dependencies that need to be resolved before the next instruction can be issued. These include register operands and predicates. The WAIT stall reason will be issued between instructions that have fixed latency. The compiler can choose to add additional waits between instructions to the same pipeline if the pipeline issue frequency is not 1 warp per cycle (e.g. FMA and ALU pipe can issue every other cycle on GV100 - GA100).

EXAMPLE 1 - No dependencies - compiler added waits

IADD  R0, R1, R2;  # R0 = R1 + R2
// stall = wait for 1 additional cycle
IADD  R4, R5, R6;  # R4 = R5 + R6
// stall = wait for 1 additional cycle
IADD  R8, R9, R10; # R8 = R9 + R10

If the compiler did not add wait cycles then the stall reason would be math_throttle. This can also show up if the warp is ready to issue the instruction (all dependencies resolved) and another warp is issuing an instruction to the target pipeline.

EXAMPLE 2 - Wait stalls due to read after write dependency

IADD  R0, R1, R2;  # R0 = R1 + R2
// stall - wait for fixed number of cycles to clear read after write
IADD  R0, R0, R3;  # R0 += R3
// stall - wait for fixed number of cycles to clear read after write
IADD  R0, R0, R4;  # R0 += R4

1. "the compiler can choose" - why is the compiler involved here? And, in your example - in what way is the compiler adding waits if there are not "wait" instructions, just comments? 2. "between instructions that have fixed latency" - so, not between one instruction with fixed latency and another with variable latency, say, before it? — einpoklum, Mar 19 '21 at 17:32

Which GPU execution dependencies have fixed latency (causing 'Wait' stalls)?

1 Answers1