9

I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array.

I was trying to develop this concept using this code:

#include <iostream>
#include <conio.h>
#include <emmintrin.h>

void sse(unsigned char* a,unsigned char* b); 

void main()
{
    /*unsigned char *arr;
    arr=(unsigned char *)malloc(50);*/

    unsigned char arr[]={'a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r','a','b','c','d','e','f','i','j','k','l','m','n','o','p','q','r'};
    unsigned char *next_arr=arr+16;
    for(int i=0;i<16;i++)
          printf("%d,%c   ",next_arr[i],next_arr[i]);
    sse(arr,next_arr);

    getch();
}

void sse(unsigned char* a,unsigned char* b)                                                                                                                                                                          
{                                                                                                                                                                                                                                                                                                                                                                                            
  __m128i* l = (__m128i*)a;                                                                                                                                                                                      
  __m128i* r = (__m128i*)b; 
  __m128i result;

      result= _mm_add_epi8(*l, *r);

      unsigned char *p;
         p=(unsigned char *)&result;

        for(int i=0;i<16;i++)
          printf("%d ",p[i]);

         printf("\n");
         l=(__m128i*)p;
         r=(__m128i*)(p+8);         
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         printf("%d ",p[0]);

         l=(__m128i*)p;
         r=(__m128i*)(p+4);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+2);
         result=_mm_add_epi8(*l, *r);
         p=(unsigned char *)&result;
         l=(__m128i*)p;
         r=(__m128i*)(p+1);
         result=_mm_add_epi8(*l, *r);
          p=(unsigned char *)&result;
            printf("result =%d ",p[0]);
}

So can anybody please tell me how it is possible to add all elements of an array using SSE2 instructions ?

Any help will be appreciated.

Paul R
  • 208,748
  • 37
  • 389
  • 560
geeta
  • 689
  • 3
  • 17
  • 33
  • Closed as duplicate because `psadbw` is *significantly* more efficient for summing 8-bit elements without overflow, and the answer there uses that. Use it with `paddd` or `paddq` for big arrays. – Peter Cordes Nov 07 '17 at 16:59

1 Answers1

22

If you just want to sum all the elements of an array then you need to load the data, unpack it to a wider element size, and then sum the unpacked elements. Note that you can maintain multiple partial sums until after the loop and then just do one final sum of these partial sums. For example:

uint32_t sum_array(const uint8_t a[], int n)
{
    const __m128i vk0 = _mm_set1_epi8(0);       // constant vector of all 0s for use with _mm_unpacklo_epi8/_mm_unpackhi_epi8
    const __m128i vk1 = _mm_set1_epi16(1);      // constant vector of all 1s for use with _mm_madd_epi16
    __m128i vsum = _mm_set1_epi32(0);           // initialise vector of four partial 32 bit sums
    uint32_t sum;
    int i;

    for (i = 0; i < n; i += 16)
    {
        __m128i v = _mm_load_si128(&a[i]);      // load vector of 8 bit values
        __m128i vl = _mm_unpacklo_epi8(v, vk0); // unpack to two vectors of 16 bit values
        __m128i vh = _mm_unpackhi_epi8(v, vk0);
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vl, vk1));
        vsum = _mm_add_epi32(vsum, _mm_madd_epi16(vh, vk1));
                                                // unpack and accumulate 16 bit values to
                                                // 32 bit partial sum vector

    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

Note that there is one non-obvious trick in the above code - rather than further unpacking each 16 bit vector to a pair of 32 bit vectors (requiring 4 unpack instructions) and then using four 32 bit adds (another 4 instructions), we use _mm_madd_epi16 (PMADDWD) with a multiplicand of 1 and _mm_add_epi32 to effectively give us free unpacking, so we get the same result using 4 instructions instead of 8.

Note also that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 16.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Thanks for reply.. Your code is showing error in the line number 10,11,13,14 and 17.The instruction _mm_madd_epi16 can not take 3 arguments. And vk0 is undefined? Please resolve these errors. – geeta Jun 07 '12 at 12:00
  • Sorry - that's what happens when you take some working code and try to edit it down into a simple example - I think it's more or less fixed now. – Paul R Jun 07 '12 at 12:07
  • 2
    FYI, I tested this on an Intel Xeon W3550 3.07GHz processor and it showed a 37% speedup compared to the naive loop: sum = 0; for (i=0; i – cape1232 Feb 20 '13 at 13:58
  • 1
    However, when used in a function to compute the zero-mean sum of absolute differences (computer vision), where the only change was to use the optimized vs. the naive code above to compute the mean of a vector, the result was 10x faster. – cape1232 Feb 20 '13 at 14:19
  • @PaulR I would very much like to learn how to do this. I can't seem to find a tutorial or introduction document that explains it well. Do you have any suggestions? – user24205 Nov 11 '16 at 07:57
  • @user24205: there are a few tutorials out there if you Google, but I would also suggest searching for the `[sse]` tag here on StackOverflow, as there are a lot of good questions and answers covering the whole range from beginner level to expert. – Paul R Nov 11 '16 at 07:59
  • `psadbw` against zero is much faster for 8-bit integer elements, especially unsigned. Will post an answer if I get around to it. Or not since there's already https://stackoverflow.com/questions/10932550/sum-reduction-of-unsigned-bytes-without-overflow-using-sse2-on-intel – Peter Cordes Nov 07 '17 at 16:55
  • Yes, @harold's solution in the linked question is the way to go... – Paul R Nov 07 '17 at 16:58