Any NEON instructions can be dual issued with vector long multiply accumulate (SMLAL) on Cortex A53 or A55?

Question

I already asked on the ARM developer forums, but no answers. https://community.arm.com/developer/tools-software/hpc/f/hpc-user-group/45524/any-neon-instructions-can-be-dual-issued-with-vector-long-multiply-accumulate-smlal-on-cortex-a53-or-a55

Anyone know if this is possible? I've tried several instruction mixes:

                                           A53                A55

smlal.int8 & 64bit load (ld1)              0.97               1.72                   instructions/cycle

smlal.int8 & 64bit dup                     0.97               0.96

smlal.int8 & 64bit or                      0.97               0.96

mla.int8 & 64bit dup                       1.79               1.74

mla.int16 & 64bit dup                      0.97               0.95

The Cortex A55 optimization guide has contradictory info.https://static.docs.arm.com/epm128372/20/arm_cortex_a55_software_optimization_guide_v2.pdf

In 1 place, it says dual issue = 01, which in my understanding means it can only issue on slot 0, but allows another SIMD instruction to issue on slot 1.

In the 2nd place, it says dual issue = 00, which means it prevents dual issue. Does that mean it uses both slots?

What's even more surprising is even a simple SIMD OR can't execute in parallel. What I really want to do is dual issue dup(specific lane) with smlal.

What's going on? Is smlal using both SIMD units?

One suggestion was it could be a register file bandwidth bottleneck

Can Cortex-A57 dual-issue 128-bit neon instructions?

That makes sense because unlike almost every other instructions, smlal takes 3 operands and writes an output that's 2x as wide, while regular mla only does 3 reads and 1 write. And it would be very unreasonable to add a 3rd write port to the NEON register file to support writing smlal's 128 bit output and another 64bit output from a 2nd instruction

Another explanation is some instructions can't be dual issued every cycle (i.e. "vector load cannot be issued on the 4th cycle of each fmla")

https://github.com/Tencent/ncnn/wiki/arm-a53-a55-dual-issue

uint8_t data[32] __attribute__((aligned(32)));
for (int i = 0; i < N; i += 16)
{
  asm volatile(
  "smlal v9.8h, v0.8b, v0.8b\n"
  "ld1 {v0.8b},%0\n"
  "smlal v10.8h, v1.8b, v1.8b\n"
  "ld1 {v1.8b},%0\n"      
  "smlal v11.8h, v2.8b, v2.8b\n"
  "ld1 {v2.8b},%0\n"    
  "smlal v12.8h, v3.8b, v3.8b\n"
  "ld1 {v3.8b},%0\n" 
  "smlal v13.8h, v4.8b, v4.8b\n"
  "ld1 {v4.8b},%0\n"    
  "smlal v14.8h, v5.8b, v5.8b\n"
  "ld1 {v5.8b},%0\n"  
  "smlal v15.8h, v6.8b, v6.8b\n"
  "ld1 {v6.8b},%0\n"
  "smlal v0.8h, v7.8b, v7.8b\n"
  "ld1 {v7.8b},%0\n"
  "smlal v1.8h, v8.8b, v8.8b\n"
  "ld1 {v8.8b},%0\n"
  "smlal v2.8h, v9.8b, v9.8b\n"
  "ld1 {v9.8b},%0\n"
  "smlal v3.8h, v10.8b, v10.8b\n"
  "ld1 {v10.8b},%0\n"
  "smlal v4.8h, v11.8b, v11.8b\n"
  "ld1 {v11.8b},%0\n"
  "smlal v5.8h, v12.8b, v12.8b\n"
  "ld1 {v12.8b},%0\n"
  "smlal v6.8h, v13.8b, v13.8b\n"
  "ld1 {v13.8b},%0\n"
  "smlal v7.8h, v14.8b, v14.8b\n"
  "ld1 {v14.8b},%0\n"
  "smlal v8.8h, v15.8b, v15.8b\n"
  "ld1 {v15.8b},%0\n"
  : :"m"(data));

}

You know that inline asm isn't safe, right? You don't even declare clobbers on the vector regs you use. Presumably it's fine in this microbenchmark — Peter Cordes, Jan 30 '20 at 07:26
"In 1 place, it says dual issue = 01, which in my understanding means it can only issue on slot 0, but allows another SIMD instruction to issue on slot 1 In the 2nd place, it says dual issue = 00, which means it prevents dual issue. Does that mean it uses both slots?" One table is AArch32, the other is AArch64. — solidpixel, Feb 05 '20 at 14:36

Any NEON instructions can be dual issued with vector long multiply accumulate (SMLAL) on Cortex A53 or A55?

0 Answers0