It's quite hard to tell why the handwritten assembly is slower than C without seeing the compiler output, without knowing if the compiler does auto-vectorization etc. However it's easy to tell why the assembly code is (very) slow:
- NEON simd instructions have long latency and high throughput. By using only 1 maximum_value vector, you have serialized the originally parallel problem. All the vector operations depend on the result of the previous instruction, forcing them to wait for the whole ~4 cycle latency before they can execute. The problem is even worse on cores with in-order simd execution pipelines (all but newest "big" Cortex-A cores A9,A15,A57,A72 and some from Apple).
- If the input array is large and not present in caches, the code is limited by waiting for memory operations to complete. Some ARM processors have hardware L2 memory prefetchers but even on these prefetching the memory in software can speed up the loop many times.
A fast implementation written in NEON intrinsics might look like this:
int16_t* buf = inp_frame;
// These variables hold the absolute values during the loop.
// Must use 32-bit values because abs(INT16_MIN) doesn't fit in 16-bit signed int.
int32x4_t max0 = vmovq_n_s32(INT16_MIN);
int32x4_t max1 = vmovq_n_s32(INT16_MIN);
int32x4_t max2 = vmovq_n_s32(INT16_MIN);
int32x4_t max3 = vmovq_n_s32(INT16_MIN);
int32x4_t max4 = vmovq_n_s32(INT16_MIN);
int32x4_t max5 = vmovq_n_s32(INT16_MIN);
int32x4_t max6 = vmovq_n_s32(INT16_MIN);
int32x4_t max7 = vmovq_n_s32(INT16_MIN);
// Process 32 values = 64 bytes per iteration.
for(int i = frame_size / 32; i != 0; i--)
{
// Prefetch data 8 64-byte cache lines ahead (or 16, optimal distance depends on hw).
__prefetch(8 * 64 + ((int8_t*)buf)); // whatever intrinsic your compiler has
int16x8_t val0 = vld1q_s16(buf);
int16x8_t val1 = vld1q_s16(buf + 8);
int16x8_t val2 = vld1q_s16(buf + 16);
int16x8_t val3 = vld1q_s16(buf + 24);
buf += 32;
// Widen the values before taking abs.
int32x4_t vall0 = vmovl_s16(vget_low_s16(val0));
int32x4_t vall1 = vmovl_s16(vget_high_s16(val0));
int32x4_t vall2 = vmovl_s16(vget_low_s16(val1));
int32x4_t vall3 = vmovl_s16(vget_high_s16(val1));
int32x4_t vall4 = vmovl_s16(vget_low_s16(val2));
int32x4_t vall5 = vmovl_s16(vget_high_s16(val2));
int32x4_t vall6 = vmovl_s16(vget_low_s16(val3));
int32x4_t vall7 = vmovl_s16(vget_high_s16(val3));
int32x4_t abs_vall0 = vabsq_s32(vall0);
int32x4_t abs_vall1 = vabsq_s32(vall1);
int32x4_t abs_vall2 = vabsq_s32(vall2);
int32x4_t abs_vall3 = vabsq_s32(vall3);
int32x4_t abs_vall4 = vabsq_s32(vall4);
int32x4_t abs_vall5 = vabsq_s32(vall5);
int32x4_t abs_vall6 = vabsq_s32(vall6);
int32x4_t abs_vall7 = vabsq_s32(vall7);
max0 = vmaxq_s32(max0, abs_vall0);
max1 = vmaxq_s32(max1, abs_vall1);
max2 = vmaxq_s32(max2, abs_vall2);
max3 = vmaxq_s32(max3, abs_vall3);
max4 = vmaxq_s32(max4, abs_vall4);
max5 = vmaxq_s32(max5, abs_vall5);
max6 = vmaxq_s32(max6, abs_vall6);
max7 = vmaxq_s32(max7, abs_vall7);
}
// Reduce the maximum value to a single one.
int32x4_t max01 = vmaxq_s32(max0, max1);
int32x4_t max23 = vmaxq_s32(max2, max3);
int32x4_t max45 = vmaxq_s32(max4, max5);
int32x4_t max67 = vmaxq_s32(max6, max7);
int32x4_t max0123 = vmaxq_s32(max01, max23);
int32x4_t max4567 = vmaxq_s32(max45, max67);
int32x4_t qmax = vmaxq_s32(max0123, max4567);
// Horizontal max inside q-register.
int32x2_t dmax = vmax_s32(vget_low_s32(qmax), vget_high_s32(qmax));
int32_t max_value = vget_lane_s32(vpmax_s32(dmax, dmax), 0);
// TODO process any remaining items here
This kind interleaving produces lots of instruction level parallelism, allowing the core to execute instructions every cycle instead of stalling because of data dependencies. 8-way interleaving/unrolling is enough to keep even the fastest Cortex-A72 that can execute 2 of all these 3-cycle-latency NEON ALU instructions per clock, busy. Note that the code uses all the 16 architectual q-registers available, so you may want to check that the compiler doesn't spill any of them to stack (all compilers don't handle the situation very well).