Checksum code implementation for Neon in Intrinsics

Question

I'm trying to implement the checksum computation code(2's complement addition) for NEON, using intrinsic. The current checksum computation is being carried out on ARM.

My implementation fetches 128-bits at once from the memory into NEON registers and does SIMD (addition), and result is folded to a 16-bit number from a 128-bit number.

Everything looks to be working fine, but my NEON implementation is consuming more time that of the ARM version.

ARM version takes: 0.860000 s NEON version takes: 1.260000 s

Note:

Profiled using utilities from "time.h"
The checksum function called 10,000 times from a sample application, and time computed after complete run of all the functions

Other details:

Used GNU tool-chain(arm-none-linux-gnueabi-gcc) for compiling the intrinsic code and not arm tool-chain.
Linux platform.
C-intrinsic code.

Questions:

Why does NEON version take more time than that of the ARM version? (Although I have taken care that intrinsic with minimum cycles in the batch is used)
How do achieve what I want to achieve? (efficiency with NEON)
Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical implementation papers or talks) which uses the inter-operations of ARM-NEON together?

Any help would be much appreciated.

Here's my code:

uint16_t do_csum(const unsigned char * buff, int len)
{
int odd, count, i;

uint32x4_t result = veorq_u32( result, result), sum = veorq_u32( sum, sum); 
uint16x4_t data, data_hi, data_low, data8;
uint16x8_t dataq;
uint16_t result16, disp[20] = {0,0,0,0,0,0,0,0,0,0};

if (len <= 0)
    goto out;
odd = 1 & (unsigned long) buff;
if (odd) {
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t)vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    data1 = (uint16x4_t)vshl_n_u16( data1, 8);

    len--;
    buff++;
    result = vaddw_u16(result, data1);
}
count = len >> 1;       /* nr of 16-bit words.. */
if (count) {
    if (2 & (unsigned long) buff) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        count--;
        len -= 2;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
    count >>= 1;        /* nr of 32-bit words.. */
    if (count) {
        if (4 & (unsigned long) buff) {
            uint32x2_t data4 = (uint16x4_t) vld1_lane_u32((uint32_t *) buff, data4, 0);
            count--;
            len -= 4;
            buff += 4;
            result = vaddw_u16( result, data4);
        }
        count >>= 1;    /* nr of 64-bit words.. */
        if (count) {
            if (8 & (unsigned long) buff) {
                uint64x1_t data8 = vld1_u64((uint64_t *) buff); 
                count--;
                len -= 8;
                buff += 8;
                result = vaddw_u16( result,(uint16x4_t)data8);
            }
            count >>= 1;    /* nr of 128-bit words.. */
            if (count) {
                do {
                    dataq = (uint16x8_t)vld1q_u64((uint64_t *) buff); // VLD1.64 {d0, d1}, [r0]
                    count--;
                    buff += 16;

                    sum = vpaddlq_u16(dataq);   
                    vst1q_u16( disp, dataq); // VST1.16 {d0, d1}, [r0]

                    result = vaddq_u32( sum, result);
                } while (count);
            }
            if (len & 8) {
                uint64x1_t data8 =  vld1_u64((uint64_t *) buff); 
                buff += 8;
                result = vaddw_u16( result, (uint16x4_t)data8);
            }
        }
        if (len & 4) {
            uint32x2_t data4 = veor_u32( data4, data4); 

            data4 = (uint16x4_t)vld1_lane_u32((uint32_t *) buff, data4, 0);//result += *(unsigned int *) buff;
            buff += 4;
            result = vaddw_u16( result,(uint16x4_t) data4);
        }
    }
    if (len & 2) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
}
if (len & 1){
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t) vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    result = vaddw_u8( result, data1);
}


result16 = from128to16(result);

if (odd)
    result16 = ((result16 >> 8) & 0xff) | ((result16 & 0xff) << 8);

out:
    return result16;
}

Show your code and I'll be able to tell you what's wrong with it. Are you using GCC? If so, I would recommend writing assembly language in a separate file or use inline asm since GCC doesn't do well with intrinsics. — BitBank, Aug 22 '12 at 05:50
@BitBank: Thanks, have edited my question to include the code, yes I'm using the cross compiler gcc. Using intrinsic, as I'm little unprepared to get into the shallow waters of assembly. — nguns, Aug 22 '12 at 06:03
What value of `len` are you using for testing ? Also, are you compiling with `-O3` ? — Paul R, Aug 22 '12 at 06:11
Thanks for the edit @Paul R, 1. The length is 2k bytes (data read from a file into an array into application then passed on to the do_sum function). 2. I'm using the following command for compiling: arm-none-linux-gnueabi-gcc -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions -c not using any levels(honestly, dint know anything related to levels). — nguns, Aug 22 '12 at 06:43
You **really** need `gcc -O3 ...` to enable compiler optimisations. — Paul R, Aug 22 '12 at 07:26

score 6 · Accepted Answer · edited May 23 '17 at 10:27

6

A few things you can improve:

Get rid of the stores to disp - this looks like debug code that got left in ?
Don't do horizontal addition within your main loop - just do partial (vertical) sums in the loop and do one final horizontal addition after the loop (see this answer for an example of how to do this - it's for SSE but the principle is the same)
Make sure you use gcc -O3 ... to get maximum benefit from compiler optimisation
Don't use goto ! (Doesn't affect performance but is evil.)

edited May 23 '17 at 10:27

Community

1
1

answered Aug 22 '12 at 06:16

Paul R

208,748
37
389
560

1. The disp code is indeed debug code, I'm commenting that out, got left out here, sorry about that. 2. Could you enlighten more on this? 3. Consider it done. – nguns Aug 22 '12 at 06:39
used the option suggested by you, :arm-none-linux-gnueabi-gcc -03 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions -c neonChecksum.c, but the compiler throws an error :arm-none-linux-gnueabi-gcc: unrecognized option '-03' – nguns Aug 22 '12 at 08:56
I'm sorry, did you mean O3(Alphabet O, numeric 3), it looks like 03(numeric 0, numeric 3), sorry for the above comment, now it compiled fine, will soon update you with my findings. – nguns Aug 22 '12 at 08:59
Whoa...!!! It works amazingly fast..!! The time now taken is: **0.050000 s**..!! **16X** better than ARM and **24X** better than the NEON code that wasn't optimised using the option -O3..!! Thanks @Paul R. I'm all set to accept this answer, also if you could answer my other questions listed in my main question. – nguns Aug 22 '12 at 09:38
Just as a NOTE: if I also optimise ARM code while compiling with -O3 option, then its **0.200000** s, which means NEON code(optimised with -O3) is only **4X** better the ARM code?? – nguns Aug 22 '12 at 10:16
1

4X is not bad - you can do better if you are prepared to spend a lot of time writing and hand optimsiing NEON asm, but if you can meet your performance goals with a fairly simple implementation using intrinsics as above then be happy with that. – Paul R Aug 22 '12 at 11:07
Note that on StackOverflow you should only ask one question per question - if you still have further questions then please post them as new questions. See the SO FAQ for more details on proper etiquette: http://stackoverflow.com/faq – Paul R Aug 22 '12 at 11:09
those questions are sub questions, related to this specific question. If I individually post them, they will be closed as open ended questions. – nguns Aug 22 '12 at 11:28

Checksum code implementation for Neon in Intrinsics

1 Answers1

Linked