5

The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).

However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock (only neon integer arith, no loads/stores).

Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?

Thx, Clemens

user2923748
  • 315
  • 2
  • 9
  • There are a fair few instruction that can only issue down one pipe, and also quite a number where the Q form has a cycle more latency/less throughput than the D form - what's the _actual_ code (disassembly) in question? – Notlikethat Dec 02 '15 at 09:01
  • The code in question looks e.g. like: .loop: vand q2, q3, q2 vand q3, q4, q3 vand q4, q3, q4 vand q5, q6, q5 vand q6, q7, q6 vand q7, q8, q7 vand q8, q9, q8 vand q9, q10, q9 vand q10, q11, q10 vand q11, q12, q11 vand q12, q13, q12 vand q13, q14, q13 vand q14, q15, q14 vand q15, q1, q15 subs r0, r0, #1 bne .loop – user2923748 Dec 02 '15 at 11:01
  • 2
    Don't paste code in a comment. You can edit your question for that. – ElderBug Dec 02 '15 at 11:34
  • 1
    You have `vand q4, q3, q4` in your code instead of `vand q4, q5, q4`. This will add a dependency on the previous instruction. – ElderBug Dec 02 '15 at 11:36
  • Thanks for the hint, it was a typo. I benchmarked again and with the code at http://pastebin.com/AQCN5uuM I get roughly an IPC of ~1.5. I really wonder what is going wrong here... – user2923748 Dec 02 '15 at 15:51

2 Answers2

2

According to ARM support the reason seems to be that the NEON register file only supports 3x 64-bit write ports.

So although the NEON ALUs are capable of processing 2x128-bit vectors, the register file is not capable of consuming the results ... what a (very) strange design descision.

user2923748
  • 315
  • 2
  • 9
  • I have verified this limit in practice through extensive benchmarking, and observed the same behavior in the Cortex-A72 core (BCM2711 CPU used in the Raspberry Pi 4). Additionally, although you speak of "NEON register file", this appears to be a global limit, not NEON-specific: a loop of multiple copies of fully independent 2 NEON + 1 scalar instruction gives even worse performance than 2 NEON instructions alone. – swineone Dec 14 '21 at 15:34
0

In real code, not all instruction results will be written to the register file, instead they will pass through forwarding paths. If you mix dependent and independent instructions in your code, you may see higher IPC.

The A57 optimisation guide states that late-forwarding occurs for chains of multiply-accumulate instructions, so maybe something like this will dual-issue.

.loop
    vmla.s16 q0,q0,q1
    vmla.s16 q0,q0,q2
    vmla.s16 q0,q0,q3
    vmla.s16 q4,q4,q1
    vmla.s16 q4,q4,q2
    vmla.s16 q4,q4,q3
    ...etc
Charles Baylis
  • 851
  • 7
  • 8