In order to optimize a heavily used inner loop (3x3xN tensor convolution in winograd domain), I had some good results by using the maximum amount of neon registers (32) and trying to read as little coefficients/data compared to the number of arithmetic operations.
As expected, the larger kernel outperformed the first approach by some 15-25% on MacBook M1, on iPhones (SE 2020, iPhone 8+) and on Exynos 9820 (Exynos M4, Cortex-A75 micro-architecture). However, to my great surprise the larger kernel was up to 100% slower on Exynos 9611 (Cortex-A73/Cortex-A53).
My first kernels did split the convolution in 4 these kind of loops, each processing two outputs, formed like this (and recombining the accumulators in between).
3c0b50:
ldr q0, [x6] // loads 4 coefficients
ldp q25, q27, [x2] // loads 8 data
ldr q26, [x6, x16] // 4 more coefficients
add x6, x6, #16
subs w19, w19, #1
fmla v23.4s, v25.4s, v0.s[0]
fmla v19.4s, v25.4s, v26.s[0]
fmla v17.4s, v27.4s, v0.s[1]
fmla v18.4s, v27.4s, v26.s[1]
ldp q25, q27, [x2, #32] // 8 more coefficients
add x2, x2, #64
fmla v22.4s, v25.4s, v0.s[2]
fmla v20.4s, v25.4s, v26.s[2]
fmla v24.4s, v27.4s, v0.s[3]
fmla v21.4s, v27.4s, v26.s[3]
b.ne 0x3c0b50
In this variant we have 8 accumulators, 2 registers for data and 4 registers for coefficients, 4 instructions for overhead, 8 instructions for arithmetic and 4 instructions for memory access. The loop count is typically in the order of 8..64.
The second variant has 24 accumulators, 24 instructions for arithmetic, 7 instructions loading from memory and 2 instructions for overhead.
0x3c4110:
ldp q0, q1, [x4], #32
ldp q4, q5, [x5], #32
ldp q6, q7, [x5], #32
fmla v8.4s, v4.4s, v0.s[0]
fmla v9.4s, v4.4s, v0.s[1]
fmla v10.4s, v4.4s, v0.s[2]
ldp q2, q3, [x4], #32
fmla v11.4s, v5.4s, v0.s[3]
fmla v12.4s, v5.4s, v1.s[0]
fmla v13.4s, v5.4s, v1.s[1]
ldp q4, q5, [x5], #32 // reload q4,q5 just after they are consumed
fmla v14.4s, v6.4s, v1.s[2]
fmla v15.4s, v6.4s, v1.s[3]
fmla v16.4s, v6.4s, v2.s[0]
ldp q0, q1, [x4], #32 // reload q0,q1 just after they are consumed
fmla v17.4s, v7.4s, v2.s[1]
fmla v18.4s, v7.4s, v2.s[2]
fmla v19.4s, v7.4s, v2.s[3]
ldp q6, q7, [x5], #32 // reload q6,q7 just after they are consumed
add x3, x3, #1
fmla v20.4s, v4.4s, v3.s[0]
fmla v21.4s, v4.4s, v3.s[1]
fmla v22.4s, v4.4s, v3.s[2]
fmla v23.4s, v5.4s, v3.s[3]
fmla v24.4s, v5.4s, v0.s[0]
fmla v25.4s, v5.4s, v0.s[1]
fmla v26.4s, v6.4s, v0.s[2]
fmla v27.4s, v6.4s, v0.s[3]
fmla v28.4s, v6.4s, v1.s[0]
fmla v29.4s, v7.4s, v1.s[1]
fmla v30.4s, v7.4s, v1.s[2]
fmla v31.4s, v7.4s, v1.s[3]
tbz w3, #11, 0x3c4110
In addition to these inner loops, the undisclosed code initializes the accumulators and performs row and column-wise winograd output transformation (spilling to memory). I do not want to expose all that code, which I hope to be irrelevant to the performance; instead I'm asking if there's something easily spotted problem with the larger kernel making it perform much more inefficiently on the Cortex-A73 processors.
EDIT
What I can spot from the loops is that none of labels are aligned to a cache line. The smaller loop is btw exactly 16 instructions, 64 bytes (or a cache line). The other loop is 33 instructions, with a remote possibility to infer the branch condition from the local temporary data register tbz x5, #??, 0x3c4110
. This would bring the instruction count to 32, removing add x3, x3, #1
. Then it would make sense also to align the loop start to a cache line boundary.
Update
There are some slight improvements found by applying the suggestions in the comments, i.e. reading with ldp q0,q1,[x0], 128; ldp q2,q3,[x0, #-112]
. (Execution time reduced from 194ms to 190ms on a very low end device). So far this suggest the problem is not necessarily in the inner loops per se; the memory accesses differ very slightly between the two approaches (the number of arithmetic operations is the same, the number of coefficients read is the same, but the larger kernel shares the data slightly more). It's possible that the cache hierarchy plays tricks in all the A53 or A73 architectures alike.
Other undisclosed factor is that we are multithreading of course, and the BIG.little architecture can paradoxically slow down when the kernel executes faster -- at least if the output is synchronised to frame rate. In that case the OS can counterintuitively detect that a fast core is too idle after finishing all the tasks switching the operation to low power core, where it spends all the allocated time. This is anyway an issue (we thought) to have been resolved earlier -- see https://stackoverflow.com/a/64243494/1716339.