SIMD the following code

Question

How do I SIMIDize the following code in C (using SIMD intrinsics of course)? I am having trouble understanding SIMD intrinsics and this would help a lot:

int sum_naive( int n, int *a )
{
    int sum = 0;
    for( int i = 0; i < n; i++ )
        sum += a[i];
    return sum;
}

SSE The following intrinsics could be used.__m128i _mm_setzero_si128( ) __m128i _mm_loadu_si128( __m128i *p ) __m128i _mm_add_epi32( __m128i a, __m128i b )(a0+b0, a1+b1, a2+b2, a3+b3) void _mm_storeu_si128( __m128i *p, __m128i a ) — user1585869, Aug 08 '12 at 21:03
int sum_vectorized( int n, int *a ) { __m128i sum = _mm_setzero_si128(); __m128i a = _mm_loadu_si128( __m128i *p) for (int i = 0; i — user1585869, Aug 08 '12 at 21:22
Welcome to StackOverflow. You should extend the question with what you tried instead of commenting. — unkulunkulu, Aug 09 '12 at 08:22

Paul R · Answer 1 · 2012-08-09T07:57:42.850

17

Here's a fairly straightforward implementation (warning: untested code):

int32_t sum_array(const int32_t a[], const int n)
{
    __m128i vsum = _mm_set1_epi32(0);       // initialise vector of four partial 32 bit sums
    int32_t sum;
    int i;

    for (i = 0; i < n; i += 4)
    {
        __m128i v = _mm_load_si128(&a[i]);  // load vector of 4 x 32 bit values
        vsum = _mm_add_epi32(vsum, v);      // accumulate to 32 bit partial sum vector
    }
    // horizontal add of four 32 bit partial sums and return result
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8));
    vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
    sum = _mm_cvtsi128_si32(vsum);
    return sum;
}

Note that the input array, a[], needs to be 16 byte aligned, and n should be a multiple of 4.

edited Aug 09 '12 at 07:57

answered Aug 09 '12 at 06:23

Paul R

208,748
37
389
560

Couple minor things. _mm_setzero_si128() is better, instruction is faster than RAM reference and not all compilers optimize away set with 0 argument. Also, for horizontal operations, I often see in profiler that it’s faster to store then use scalar code. – Soonts May 25 '19 at 00:42
@Soonts: for any reasonable value of n the code before/after the loop is pretty much irrelevant. – Paul R May 25 '19 at 06:54
Isn't a+i faster than &a[i]? – wallabra Nov 24 '19 at 16:36
1

@Gustavo6046: any half-decent compiler will generate identical code for these two equivalent ways of specifying the same thing. Try it out on [godbolt](https://godbolt.org) if you have any doubts. – Paul R Nov 24 '19 at 21:51

SIMD the following code

1 Answers1

Linked