I already asked on the ARM developer forums, but no answers. https://community.arm.com/developer/tools-software/hpc/f/hpc-user-group/45524/any-neon-instructions-can-be-dual-issued-with-vector-long-multiply-accumulate-smlal-on-cortex-a53-or-a55
Anyone know if this is possible? I've tried several instruction mixes:
A53 A55
smlal.int8 & 64bit load (ld1) 0.97 1.72 instructions/cycle
smlal.int8 & 64bit dup 0.97 0.96
smlal.int8 & 64bit or 0.97 0.96
mla.int8 & 64bit dup 1.79 1.74
mla.int16 & 64bit dup 0.97 0.95
The Cortex A55 optimization guide has contradictory info.https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf
In 1 place, it says dual issue = 01, which in my understanding means it can only issue on slot 0, but allows another SIMD instruction to issue on slot 1.
In the 2nd place, it says dual issue = 00, which means it prevents dual issue. Does that mean it uses both slots?
What's even more surprising is even a simple SIMD OR can't execute in parallel. What I really want to do is dual issue dup(specific lane) with smlal.
What's going on? Is smlal using both SIMD units?
One suggestion was it could be a register file bandwidth bottleneck
Can Cortex-A57 dual-issue 128-bit neon instructions?
That makes sense because unlike almost every other instructions, smlal takes 3 operands and writes an output that's 2x as wide, while regular mla only does 3 reads and 1 write. And it would be very unreasonable to add a 3rd write port to the NEON register file to support writing smlal's 128 bit output and another 64bit output from a 2nd instruction
Another explanation is some instructions can't be dual issued every cycle (i.e. "vector load cannot be issued on the 4th cycle of each fmla")
https://github.com/Tencent/ncnn/wiki/arm-a53-a55-dual-issue
uint8_t data[32] __attribute__((aligned(32)));
for (int i = 0; i < N; i += 16)
{
asm volatile(
"smlal v9.8h, v0.8b, v0.8b\n"
"ld1 {v0.8b},%0\n"
"smlal v10.8h, v1.8b, v1.8b\n"
"ld1 {v1.8b},%0\n"
"smlal v11.8h, v2.8b, v2.8b\n"
"ld1 {v2.8b},%0\n"
"smlal v12.8h, v3.8b, v3.8b\n"
"ld1 {v3.8b},%0\n"
"smlal v13.8h, v4.8b, v4.8b\n"
"ld1 {v4.8b},%0\n"
"smlal v14.8h, v5.8b, v5.8b\n"
"ld1 {v5.8b},%0\n"
"smlal v15.8h, v6.8b, v6.8b\n"
"ld1 {v6.8b},%0\n"
"smlal v0.8h, v7.8b, v7.8b\n"
"ld1 {v7.8b},%0\n"
"smlal v1.8h, v8.8b, v8.8b\n"
"ld1 {v8.8b},%0\n"
"smlal v2.8h, v9.8b, v9.8b\n"
"ld1 {v9.8b},%0\n"
"smlal v3.8h, v10.8b, v10.8b\n"
"ld1 {v10.8b},%0\n"
"smlal v4.8h, v11.8b, v11.8b\n"
"ld1 {v11.8b},%0\n"
"smlal v5.8h, v12.8b, v12.8b\n"
"ld1 {v12.8b},%0\n"
"smlal v6.8h, v13.8b, v13.8b\n"
"ld1 {v13.8b},%0\n"
"smlal v7.8h, v14.8b, v14.8b\n"
"ld1 {v14.8b},%0\n"
"smlal v8.8h, v15.8b, v15.8b\n"
"ld1 {v15.8b},%0\n"
: :"m"(data));
}