From the point of view of instruction pipeline, why do the instructions in the following code not block each other?
.text
.global stress
// void stress(int)
stress:
.loop1:
fmla v2.4s, v0.4s, v1.4s
fmla v3.4s, v0.4s, v1.4s
subs x0, x0, #1
fmla v4.4s, v0.4s, v2.4s
fmla v5.4s, v0.4s, v3.4s
fmla v6.4s, v0.4s, v4.4s
fmla v7.4s, v0.4s, v5.4s
fmla v8.4s, v0.4s, v6.4s
fmla v9.4s, v0.4s, v7.4s
bne .loop1
ret
With minor modifications, they block each other (as expected).
.text
.global stress
// void stress(int)
stress:
.loop1:
fmla v2.4s, v0.4s, v8.4s
fmla v3.4s, v0.4s, v9.4s
subs x0, x0, #1
fmla v4.4s, v0.4s, v2.4s
fmla v5.4s, v0.4s, v3.4s
fmla v6.4s, v0.4s, v4.4s
fmla v7.4s, v0.4s, v5.4s
fmla v8.4s, v0.4s, v6.4s
fmla v9.4s, v0.4s, v7.4s
bne .loop1
ret
t1 is a quarter of t2, and the delay of fmla in the cortex-a76 is 4, which means that the first program has no blocking.