0

From the point of view of instruction pipeline, why do the instructions in the following code not block each other?

.text
.global stress
// void stress(int)

stress:
.loop1:
    fmla v2.4s, v0.4s, v1.4s
    fmla v3.4s, v0.4s, v1.4s

    subs x0, x0, #1

    fmla v4.4s, v0.4s, v2.4s
    fmla v5.4s, v0.4s, v3.4s

    fmla v6.4s, v0.4s, v4.4s
    fmla v7.4s, v0.4s, v5.4s

    fmla v8.4s, v0.4s, v6.4s
    fmla v9.4s, v0.4s, v7.4s

    bne .loop1
ret

With minor modifications, they block each other (as expected).

.text
.global stress
// void stress(int)

stress:
.loop1:
    fmla v2.4s, v0.4s, v8.4s
    fmla v3.4s, v0.4s, v9.4s

    subs x0, x0, #1

    fmla v4.4s, v0.4s, v2.4s
    fmla v5.4s, v0.4s, v3.4s

    fmla v6.4s, v0.4s, v4.4s
    fmla v7.4s, v0.4s, v5.4s

    fmla v8.4s, v0.4s, v6.4s
    fmla v9.4s, v0.4s, v7.4s

    bne .loop1
ret

t1 is a quarter of t2, and the delay of fmla in the cortex-a76 is 4, which means that the first program has no blocking.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Which data dependencies are you talking about? `fmla` only writes the destination (first operand), and I don't see an output being re-read as an input to a later instruction until at least one instruction later. (At that point it will stall, if FMLA latency * throughput is greater than 2, i.e. if the pipeline can keep more than 2 of it in flight.) – Peter Cordes Oct 20 '22 at 08:19
  • @PeterCordes `fmla v2.4s, v0.4s, v1.4s`, `fmla v4.4s, v0.4s, v2.4s`, Aren't these two interdependent? They are separated by a instruction. – zhiyujiang Oct 20 '22 at 08:26
  • Yeah, the first writes `v2`, the 2nd reads `v2`. But they're not adjacent like your question title is asking about, they're separated by at least one instruction. (Which probably isn't enough to avoid stalling, like I said in my last comment.) Actually those instructions also have a `subs` separating them. – Peter Cordes Oct 20 '22 at 08:33
  • https://en.wikipedia.org/wiki/ARM_Cortex-A76 is an out-of-order design, and there's no loop-carried dependency chain because nothing writes `v1`. So there are dependencies, but only within a single iteration which OoO exec can hide. No idea what `t1` and `t2` you're talking about. Maybe execution times? – Peter Cordes Oct 20 '22 at 08:37
  • Near duplicate of [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) and [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) (which actually starts off talking about how register renaming avoids hazards.) – Peter Cordes Oct 20 '22 at 08:40
  • @PeterCordes I thought that the first and second instructions were dual-issued and that the third instruction was logically data dependent on the first instruction, resulting in the third instruction ending 4 time cycles later than the first and second instructions, but in fact it did not. Do I understand this correctly? – zhiyujiang Oct 20 '22 at 08:42
  • If you just timed the total time for execution of many iterations of the loop, how can you be sure about the timing details of individual instructions within one iteration? Did you simulate it with LLVM-MCA or an ARM64 equivalent of https://uica.uops.info/ ? Because that would tell you that the loop iterations are overlapping thanks to out-of-order exec, but the CPU is still respecting data deps between instructions that have them. – Peter Cordes Oct 20 '22 at 08:45
  • @PeterCordes yes, t1 refers to the first program execution time, t2 refers to the second program execution time, thank you for the detailed information, I will refer to it. – zhiyujiang Oct 20 '22 at 08:46

0 Answers0