Optimizing FMA sequences on different arm64 micro-architectures

Question

In order to optimize a heavily used inner loop (3x3xN tensor convolution in winograd domain), I had some good results by using the maximum amount of neon registers (32) and trying to read as little coefficients/data compared to the number of arithmetic operations.

As expected, the larger kernel outperformed the first approach by some 15-25% on MacBook M1, on iPhones (SE 2020, iPhone 8+) and on Exynos 9820 (Exynos M4, Cortex-A75 micro-architecture). However, to my great surprise the larger kernel was up to 100% slower on Exynos 9611 (Cortex-A73/Cortex-A53).

My first kernels did split the convolution in 4 these kind of loops, each processing two outputs, formed like this (and recombining the accumulators in between).

 3c0b50:
     ldr     q0, [x6]                  // loads 4 coefficients
     ldp     q25, q27, [x2]            // loads 8 data
     ldr     q26, [x6, x16]            // 4 more coefficients
     add     x6, x6, #16
     subs    w19, w19, #1
     fmla    v23.4s, v25.4s, v0.s[0]
     fmla    v19.4s, v25.4s, v26.s[0]
     fmla    v17.4s, v27.4s, v0.s[1]
     fmla    v18.4s, v27.4s, v26.s[1]
     ldp     q25, q27, [x2, #32]       // 8 more coefficients
     add     x2, x2, #64
     fmla    v22.4s, v25.4s, v0.s[2]
     fmla    v20.4s, v25.4s, v26.s[2]
     fmla    v24.4s, v27.4s, v0.s[3]
     fmla    v21.4s, v27.4s, v26.s[3]
     b.ne    0x3c0b50

In this variant we have 8 accumulators, 2 registers for data and 4 registers for coefficients, 4 instructions for overhead, 8 instructions for arithmetic and 4 instructions for memory access. The loop count is typically in the order of 8..64.

The second variant has 24 accumulators, 24 instructions for arithmetic, 7 instructions loading from memory and 2 instructions for overhead.

 0x3c4110:
     ldp     q0, q1, [x4], #32
     ldp     q4, q5, [x5], #32
     ldp     q6, q7, [x5], #32
     fmla    v8.4s, v4.4s, v0.s[0]
     fmla    v9.4s, v4.4s, v0.s[1]
     fmla    v10.4s, v4.4s, v0.s[2]
     ldp     q2, q3, [x4], #32
     fmla    v11.4s, v5.4s, v0.s[3]
     fmla    v12.4s, v5.4s, v1.s[0]
     fmla    v13.4s, v5.4s, v1.s[1]
     ldp     q4, q5, [x5], #32      // reload q4,q5 just after they are consumed
     fmla    v14.4s, v6.4s, v1.s[2]
     fmla    v15.4s, v6.4s, v1.s[3]
     fmla    v16.4s, v6.4s, v2.s[0]
     ldp     q0, q1, [x4], #32      // reload q0,q1 just after they are consumed
     fmla    v17.4s, v7.4s, v2.s[1]
     fmla    v18.4s, v7.4s, v2.s[2]
     fmla    v19.4s, v7.4s, v2.s[3]
     ldp     q6, q7, [x5], #32      // reload q6,q7 just after they are consumed
     add     x3, x3, #1
     fmla    v20.4s, v4.4s, v3.s[0]
     fmla    v21.4s, v4.4s, v3.s[1]
     fmla    v22.4s, v4.4s, v3.s[2]
     fmla    v23.4s, v5.4s, v3.s[3]
     fmla    v24.4s, v5.4s, v0.s[0]
     fmla    v25.4s, v5.4s, v0.s[1]
     fmla    v26.4s, v6.4s, v0.s[2]
     fmla    v27.4s, v6.4s, v0.s[3]
     fmla    v28.4s, v6.4s, v1.s[0]
     fmla    v29.4s, v7.4s, v1.s[1]
     fmla    v30.4s, v7.4s, v1.s[2]
     fmla    v31.4s, v7.4s, v1.s[3]
     tbz     w3, #11, 0x3c4110

In addition to these inner loops, the undisclosed code initializes the accumulators and performs row and column-wise winograd output transformation (spilling to memory). I do not want to expose all that code, which I hope to be irrelevant to the performance; instead I'm asking if there's something easily spotted problem with the larger kernel making it perform much more inefficiently on the Cortex-A73 processors.

EDIT

What I can spot from the loops is that none of labels are aligned to a cache line. The smaller loop is btw exactly 16 instructions, 64 bytes (or a cache line). The other loop is 33 instructions, with a remote possibility to infer the branch condition from the local temporary data register tbz x5, #??, 0x3c4110. This would bring the instruction count to 32, removing add x3, x3, #1. Then it would make sense also to align the loop start to a cache line boundary.

Update

There are some slight improvements found by applying the suggestions in the comments, i.e. reading with ldp q0,q1,[x0], 128; ldp q2,q3,[x0, #-112]. (Execution time reduced from 194ms to 190ms on a very low end device). So far this suggest the problem is not necessarily in the inner loops per se; the memory accesses differ very slightly between the two approaches (the number of arithmetic operations is the same, the number of coefficients read is the same, but the larger kernel shares the data slightly more). It's possible that the cache hierarchy plays tricks in all the A53 or A73 architectures alike.

Other undisclosed factor is that we are multithreading of course, and the BIG.little architecture can paradoxically slow down when the kernel executes faster -- at least if the output is synchronised to frame rate. In that case the OS can counterintuitively detect that a fast core is too idle after finishing all the tasks switching the operation to low power core, where it spends all the allocated time. This is anyway an issue (we thought) to have been resolved earlier -- see https://stackoverflow.com/a/64243494/1716339.

Are you using `tbl` and/or `tbx` instructions in the undisclosed part? They have **horrible** latencies on Cortex-a72, and don't even pipeline on the A57. https://developer.arm.com/documentation/uan0016/a ; Maybe the A73 behaves the same. — Jake 'Alquimista' LEE, May 07 '22 at 09:09
BTW, I'd swap the first two lines to avoid potential RAW dependency on `x5`. — Jake 'Alquimista' LEE, May 07 '22 at 09:13
No tbl/tbx there. Just a few floating point subtractions/additions. (and the first iterations `fmul acc, data, coeff[X]`) — Aki Suihkonen, May 07 '22 at 09:34
Do you have an a53-only board to run tests on? The earlier OoO `aarch64` architectures such as a57 and a72 seem to have problems with some specific instructions. And I don't think loop alignment could make such a huge difference in performance. — Jake 'Alquimista' LEE, May 07 '22 at 10:16
To @Jake'Alquimista'LEE's suggestion about x5, the Cortex-A72 optimization manual actually suggests not using writeback indexing in unrolled loops like this, precisely to avoid the dependency chain on the address register. They suggest instead `ldp q4, q5, [x5] ; ldp q6, q7, [x5, #32] ; ... ldp q4, q5, [x5, #64] ; ldp q6, q7, [x5, #96] ; add x5, x5, #128`. Even though the `add` looks like one more instruction, it can almost certainly run OoO on the integer pipeline while all your SIMD work is going on, without affecting total latency. (Probably not your real issue, though.) — Nate Eldredge, May 08 '22 at 18:14
@Jake'Alquimista'LEE: Page 38 in the [Cortex-A72 Software Optimization Guide](https://developer.arm.com/documentation/uan0016/a/) says "Use discrete, non-writeback forms of load and store instructions" and shows a "recommended" example with no writeback at all, and `add` at the end. — Nate Eldredge, May 09 '22 at 00:37
@NateEldredge Sorry, I got your comment wrong myself. (and deleted my answer) Still, that guide isn't the holy grail. You can actually kill two birds with one stone by: `sub x5, x5, #32; [loop start]; ldp q4, q5, [x5, #32]; ldp q6, q7, [x5, #64]; ldp q8, q9, [x5, #96]; ldp q10, q11, [x5, #128]!;......` This way, you can reduce the instruction count without any penalty at all. — Jake 'Alquimista' LEE, May 09 '22 at 01:29
Would you please share the reason of the performance hit once you found it? I'm very curious myself. — Jake 'Alquimista' LEE, May 10 '22 at 05:04

Optimizing FMA sequences on different arm64 micro-architectures

0 Answers0