14

How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot:

int sum_naive( int n, int *a )
{
    int sum = 0;
    for( int i = 0; i < n; i++ )
        sum += a[i];
    return sum;
}
Paul R
  • 208,748
  • 37
  • 389
  • 560
user1585869
  • 301
  • 3
  • 11
  • 1
    Which SIMD did you have in mind? SSE2? – harold Aug 08 '12 at 20:52
  • SSE The following intrinsics could be used.__m128i _mm_setzero_si128( ) __m128i _mm_loadu_si128( __m128i *p ) __m128i _mm_add_epi32( __m128i a, __m128i b )(a0+b0, a1+b1, a2+b2, a3+b3) void _mm_storeu_si128( __m128i *p, __m128i a ) – user1585869 Aug 08 '12 at 21:03
  • 1
    Ok, so SSE2. What have you tried? – harold Aug 08 '12 at 21:08
  • int sum_vectorized( int n, int *a ) { __m128i sum = _mm_setzero_si128(); __m128i a = _mm_loadu_si128( __m128i *p) for (int i = 0; i – user1585869 Aug 08 '12 at 21:22
  • 5
    Welcome to StackOverflow. You should extend the question with what you tried instead of commenting. – unkulunkulu Aug 09 '12 at 08:22

1 Answers1

17

Here's a fairly straightforward implementation (warning: untested code):

int32_t sum_array(const int32_t a[], const int n)
{
    __m128i vsum = _mm_set1_epi32(0);       // initialise vector of four partial 32 bit sums
    int32_t sum;
    int i;

    for (i = 0; i < n; i += 4)
    {
        __m128i v = _mm_load_si128(&a[i]);  // load vector of 4 x 32 bit values
        vsum = _mm_add_epi32(vsum, v);      // accumulate to 32 bit partial sum vector
    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

Note that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 4.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • Couple minor things. _mm_setzero_si128() is better, instruction is faster than RAM reference and not all compilers optimize away set with 0 argument. Also, for horizontal operations, I often see in profiler that it’s faster to store then use scalar code. – Soonts May 25 '19 at 00:42
  • @Soonts: for any reasonable value of n the code before/after the loop is pretty much irrelevant. – Paul R May 25 '19 at 06:54
  • Isn't a+i faster than &a[i]? – wallabra Nov 24 '19 at 16:36
  • 1
    @Gustavo6046: any half-decent compiler will generate identical code for these two equivalent ways of specifying the same thing. Try it out on [godbolt](https://godbolt.org) if you have any doubts. – Paul R Nov 24 '19 at 21:51