1

I have written arm v7 assembly code for below c function. But our assembly code is taking more time compared to C code. Please can anyone tell me the reason.

int get_maximum_sample_value (short int *inp_frame, int frame_size) {
    short int *temp_buff = inp_frame; // Holds the local pointer.

    int maximum_value = -1000; // Holds the maximum value.
    int abs_value     = 0;     // Holds the absolute value.

    // Get the maximum value of the frame.
    for (int index = 0; index < frame_size; ++index) {

        abs_value = abs(*temp_buff);

        if (maximum_value < abs_value) {
            maximum_value = abs_value;
        }
        ++temp_buff;
    }

    return maximum_value;
}

asm:

.cfi_startproc

push{r4}

ldr r4,LC_P1000 // LC_P1000 = -1000
vdup.s32 q2,r4
cmp   r1, #0
beq   LP_VD_END

lsrs r4,r1,#2
beq  LP_VD_END

LP_VD1:

vldm r0,{d0}
add r0,#8
vmovl.s16 q1,d0

vabs.s32 q1,q1
subs r4,  r4, #1
vmax.s32 q2,q1,q2
bne LP_VD1
vmax.s32 d4,d5,d4

vmov r0,s8
vmov r2,s9
cmp r0, r2
it  lt
movlt   r0, r2

LP_VD_END:
pop{r4}
bx lr
.cfi_endproc
CristiFati
  • 38,250
  • 9
  • 50
  • 87
ravi
  • 63
  • 1
  • 8
  • I haven't really read your code but one thing to keep in mind: Compilers are "much" better at optimizing code than us humans – lakshayg Feb 16 '16 at 07:51
  • 1
    You should make the compiler output assembler source code (every C-compiler I've tried has that option) then you can compare it with your asm-version. – Ville Krumlinde Feb 16 '16 at 08:36
  • What did the compiler do badly that you see you can do better? It's often possible to beat the compiler, esp if you know the target microarchitecture and how to optimize for it. (e.g. for x86, http://agner.org/optimize/ microarchitecture guide). Often `gcc -O3` output is a good starting point, though. I wouldn't be surprised if your compiler already did a decent job auto-vectorizing this function, so there might not be much to gain. Sometimes you can only cut out one or two instructions, or tighten up the branching layout. – Peter Cordes Feb 16 '16 at 09:17
  • neither optimize C nor hand written asm can be assumed to be "faster". the answer is always "it depends". A compiler on a large project can out perform the human unless the human wants to spend a lot of time at it. For a real sized project it is trivial to find optimized C code that can be improved by hand by a human. Point being, never assume one is "better" than the other. – old_timer Feb 16 '16 at 15:32
  • Also see [Is inline assembly language slower than native C++ code?](http://stackoverflow.com/questions/9601427/is-inline-assembly-language-slower-than-native-c-code) – Bo Persson Feb 16 '16 at 20:59

1 Answers1

4

It's quite hard to tell why the handwritten assembly is slower than C without seeing the compiler output, without knowing if the compiler does auto-vectorization etc. However it's easy to tell why the assembly code is (very) slow:

  • NEON simd instructions have long latency and high throughput. By using only 1 maximum_value vector, you have serialized the originally parallel problem. All the vector operations depend on the result of the previous instruction, forcing them to wait for the whole ~4 cycle latency before they can execute. The problem is even worse on cores with in-order simd execution pipelines (all but newest "big" Cortex-A cores A9,A15,A57,A72 and some from Apple).
  • If the input array is large and not present in caches, the code is limited by waiting for memory operations to complete. Some ARM processors have hardware L2 memory prefetchers but even on these prefetching the memory in software can speed up the loop many times.

A fast implementation written in NEON intrinsics might look like this:

int16_t* buf = inp_frame;

// These variables hold the absolute values during the loop.
// Must use 32-bit values because abs(INT16_MIN) doesn't fit in 16-bit signed int.
int32x4_t max0 = vmovq_n_s32(INT16_MIN);
int32x4_t max1 = vmovq_n_s32(INT16_MIN);
int32x4_t max2 = vmovq_n_s32(INT16_MIN);
int32x4_t max3 = vmovq_n_s32(INT16_MIN);
int32x4_t max4 = vmovq_n_s32(INT16_MIN);
int32x4_t max5 = vmovq_n_s32(INT16_MIN);
int32x4_t max6 = vmovq_n_s32(INT16_MIN);
int32x4_t max7 = vmovq_n_s32(INT16_MIN);

// Process 32 values = 64 bytes per iteration.
for(int i = frame_size / 32; i != 0; i--)
{
    // Prefetch data 8 64-byte cache lines ahead (or 16, optimal distance depends on hw).
    __prefetch(8 * 64 + ((int8_t*)buf)); // whatever intrinsic your compiler has

    int16x8_t val0 = vld1q_s16(buf);
    int16x8_t val1 = vld1q_s16(buf + 8);
    int16x8_t val2 = vld1q_s16(buf + 16);
    int16x8_t val3 = vld1q_s16(buf + 24);
    buf += 32;

    // Widen the values before taking abs.
    int32x4_t vall0 = vmovl_s16(vget_low_s16(val0));
    int32x4_t vall1 = vmovl_s16(vget_high_s16(val0));
    int32x4_t vall2 = vmovl_s16(vget_low_s16(val1));
    int32x4_t vall3 = vmovl_s16(vget_high_s16(val1));
    int32x4_t vall4 = vmovl_s16(vget_low_s16(val2));
    int32x4_t vall5 = vmovl_s16(vget_high_s16(val2));
    int32x4_t vall6 = vmovl_s16(vget_low_s16(val3));
    int32x4_t vall7 = vmovl_s16(vget_high_s16(val3));

    int32x4_t abs_vall0 = vabsq_s32(vall0);
    int32x4_t abs_vall1 = vabsq_s32(vall1);
    int32x4_t abs_vall2 = vabsq_s32(vall2);
    int32x4_t abs_vall3 = vabsq_s32(vall3);
    int32x4_t abs_vall4 = vabsq_s32(vall4);
    int32x4_t abs_vall5 = vabsq_s32(vall5);
    int32x4_t abs_vall6 = vabsq_s32(vall6);
    int32x4_t abs_vall7 = vabsq_s32(vall7);

    max0 = vmaxq_s32(max0, abs_vall0);
    max1 = vmaxq_s32(max1, abs_vall1);
    max2 = vmaxq_s32(max2, abs_vall2);
    max3 = vmaxq_s32(max3, abs_vall3);
    max4 = vmaxq_s32(max4, abs_vall4);
    max5 = vmaxq_s32(max5, abs_vall5);
    max6 = vmaxq_s32(max6, abs_vall6);
    max7 = vmaxq_s32(max7, abs_vall7);
}

// Reduce the maximum value to a single one.
int32x4_t max01 = vmaxq_s32(max0, max1);
int32x4_t max23 = vmaxq_s32(max2, max3);
int32x4_t max45 = vmaxq_s32(max4, max5);
int32x4_t max67 = vmaxq_s32(max6, max7);

int32x4_t max0123 = vmaxq_s32(max01, max23);
int32x4_t max4567 = vmaxq_s32(max45, max67);
int32x4_t qmax = vmaxq_s32(max0123, max4567);

// Horizontal max inside q-register.
int32x2_t dmax = vmax_s32(vget_low_s32(qmax), vget_high_s32(qmax));
int32_t max_value = vget_lane_s32(vpmax_s32(dmax, dmax), 0);

// TODO process any remaining items here

This kind interleaving produces lots of instruction level parallelism, allowing the core to execute instructions every cycle instead of stalling because of data dependencies. 8-way interleaving/unrolling is enough to keep even the fastest Cortex-A72 that can execute 2 of all these 3-cycle-latency NEON ALU instructions per clock, busy. Note that the code uses all the 16 architectual q-registers available, so you may want to check that the compiler doesn't spill any of them to stack (all compilers don't handle the situation very well).

Henri Ylitie
  • 131
  • 1
  • 3