1

I'm trying to implement the checksum computation code(2's complement addition) for NEON, using intrinsic. The current checksum computation is being carried out on ARM.

My implementation fetches 128-bits at once from the memory into NEON registers and does SIMD (addition), and result is folded to a 16-bit number from a 128-bit number.

Everything looks to be working fine, but my NEON implementation is consuming more time that of the ARM version.

ARM version takes: 0.860000 s NEON version takes: 1.260000 s

Note:

  1. Profiled using utilities from "time.h"
  2. The checksum function called 10,000 times from a sample application, and time computed after complete run of all the functions

Other details:

  1. Used GNU tool-chain(arm-none-linux-gnueabi-gcc) for compiling the intrinsic code and not arm tool-chain.
  2. Linux platform.
  3. C-intrinsic code.

Questions:

  1. Why does NEON version take more time than that of the ARM version? (Although I have taken care that intrinsic with minimum cycles in the batch is used)

  2. How do achieve what I want to achieve? (efficiency with NEON)

  3. Could someone point to me or share some sample implementations(pseudo-code/algorithms/code, not the theoretical implementation papers or talks) which uses the inter-operations of ARM-NEON together?

Any help would be much appreciated.

Here's my code:

uint16_t do_csum(const unsigned char * buff, int len)
{
int odd, count, i;

uint32x4_t result = veorq_u32( result, result), sum = veorq_u32( sum, sum); 
uint16x4_t data, data_hi, data_low, data8;
uint16x8_t dataq;
uint16_t result16, disp[20] = {0,0,0,0,0,0,0,0,0,0};

if (len <= 0)
    goto out;
odd = 1 & (unsigned long) buff;
if (odd) {
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t)vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    data1 = (uint16x4_t)vshl_n_u16( data1, 8);

    len--;
    buff++;
    result = vaddw_u16(result, data1);
}
count = len >> 1;       /* nr of 16-bit words.. */
if (count) {
    if (2 & (unsigned long) buff) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        count--;
        len -= 2;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
    count >>= 1;        /* nr of 32-bit words.. */
    if (count) {
        if (4 & (unsigned long) buff) {
            uint32x2_t data4 = (uint16x4_t) vld1_lane_u32((uint32_t *) buff, data4, 0);
            count--;
            len -= 4;
            buff += 4;
            result = vaddw_u16( result, data4);
        }
        count >>= 1;    /* nr of 64-bit words.. */
        if (count) {
            if (8 & (unsigned long) buff) {
                uint64x1_t data8 = vld1_u64((uint64_t *) buff); 
                count--;
                len -= 8;
                buff += 8;
                result = vaddw_u16( result,(uint16x4_t)data8);
            }
            count >>= 1;    /* nr of 128-bit words.. */
            if (count) {
                do {
                    dataq = (uint16x8_t)vld1q_u64((uint64_t *) buff); // VLD1.64 {d0, d1}, [r0]
                    count--;
                    buff += 16;

                    sum = vpaddlq_u16(dataq);   
                    vst1q_u16( disp, dataq); // VST1.16 {d0, d1}, [r0]

                    result = vaddq_u32( sum, result);
                } while (count);
            }
            if (len & 8) {
                uint64x1_t data8 =  vld1_u64((uint64_t *) buff); 
                buff += 8;
                result = vaddw_u16( result, (uint16x4_t)data8);
            }
        }
        if (len & 4) {
            uint32x2_t data4 = veor_u32( data4, data4); 

            data4 = (uint16x4_t)vld1_lane_u32((uint32_t *) buff, data4, 0);//result += *(unsigned int *) buff;
            buff += 4;
            result = vaddw_u16( result,(uint16x4_t) data4);
        }
    }
    if (len & 2) {
        uint16x4_t data2 = veor_u16( data2, data2); 
        data2 = (uint16x4_t) vld1_lane_u16((uint16_t *)buff, data2, 0); //result += *(unsigned short *) buff;
        buff += 2;
        result = vaddw_u16( result, data2);
    }
}
if (len & 1){
    uint8x8_t data1 = veor_u8( data1, data1); 
    data1 = (uint16x4_t) vld1_lane_u8((uint8_t *)buff, data1, 0); //result = *buff << 8;
    result = vaddw_u8( result, data1);
}


result16 = from128to16(result);

if (odd)
    result16 = ((result16 >> 8) & 0xff) | ((result16 & 0xff) << 8);

out:
    return result16;
}
nguns
  • 440
  • 6
  • 21
  • Show your code and I'll be able to tell you what's wrong with it. Are you using GCC? If so, I would recommend writing assembly language in a separate file or use inline asm since GCC doesn't do well with intrinsics. – BitBank Aug 22 '12 at 05:50
  • @BitBank: Thanks, have edited my question to include the code, yes I'm using the cross compiler gcc. Using intrinsic, as I'm little unprepared to get into the shallow waters of assembly. – nguns Aug 22 '12 at 06:03
  • What value of `len` are you using for testing ? Also, are you compiling with `-O3` ? – Paul R Aug 22 '12 at 06:11
  • Thanks for the edit @Paul R, 1. The length is 2k bytes (data read from a file into an array into application then passed on to the do_sum function). 2. I'm using the following command for compiling: arm-none-linux-gnueabi-gcc -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions -c not using any levels(honestly, dint know anything related to levels). – nguns Aug 22 '12 at 06:43
  • 1
    You **really** need `gcc -O3 ...` to enable compiler optimisations. – Paul R Aug 22 '12 at 07:26

1 Answers1

6

A few things you can improve:

  • Get rid of the stores to disp - this looks like debug code that got left in ?
  • Don't do horizontal addition within your main loop - just do partial (vertical) sums in the loop and do one final horizontal addition after the loop (see this answer for an example of how to do this - it's for SSE but the principle is the same)
  • Make sure you use gcc -O3 ... to get maximum benefit from compiler optimisation
  • Don't use goto ! (Doesn't affect performance but is evil.)
Community
  • 1
  • 1
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1. The disp code is indeed debug code, I'm commenting that out, got left out here, sorry about that. 2. Could you enlighten more on this? 3. Consider it done. – nguns Aug 22 '12 at 06:39
  • used the option suggested by you, :arm-none-linux-gnueabi-gcc -03 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions -c neonChecksum.c, but the compiler throws an error :arm-none-linux-gnueabi-gcc: unrecognized option '-03' – nguns Aug 22 '12 at 08:56
  • I'm sorry, did you mean O3(Alphabet O, numeric 3), it looks like 03(numeric 0, numeric 3), sorry for the above comment, now it compiled fine, will soon update you with my findings. – nguns Aug 22 '12 at 08:59
  • Whoa...!!! It works amazingly fast..!! The time now taken is: **0.050000 s**..!! **16X** better than ARM and **24X** better than the NEON code that wasn't optimised using the option -O3..!! Thanks @Paul R. I'm all set to accept this answer, also if you could answer my other questions listed in my main question. – nguns Aug 22 '12 at 09:38
  • Just as a NOTE: if I also optimise ARM code while compiling with -O3 option, then its **0.200000** s, which means NEON code(optimised with -O3) is only **4X** better the ARM code?? – nguns Aug 22 '12 at 10:16
  • 1
    4X is not bad - you can do better if you are prepared to spend a lot of time writing and hand optimsiing NEON asm, but if you can meet your performance goals with a fairly simple implementation using intrinsics as above then be happy with that. – Paul R Aug 22 '12 at 11:07
  • Note that on StackOverflow you should only ask one question per question - if you still have further questions then please post them as new questions. See the SO FAQ for more details on proper etiquette: http://stackoverflow.com/faq – Paul R Aug 22 '12 at 11:09
  • those questions are sub questions, related to this specific question. If I individually post them, they will be closed as open ended questions. – nguns Aug 22 '12 at 11:28