9

How can I get sum elements (reduction) of float vector using sse intrinsics?

Simple serial code:

void(float *input, float &result, unsigned int NumElems)
{
     result = 0;
     for(auto i=0; i<NumElems; ++i)
         result += input[i];
}
gorill
  • 1,623
  • 3
  • 20
  • 29

1 Answers1

20

Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.

#include <cassert>
#include <cstdint>
#include <emmintrin.h>

float vsum(const float *a, int n)
{
    float sum;
    __m128 vsum = _mm_set1_ps(0.0f);
    assert((n & 3) == 0);
    assert(((uintptr_t)a & 15) == 0);
    for (int i = 0; i < n; i += 4)
    {
        __m128 v = _mm_load_ps(&a[i]);
        vsum = _mm_add_ps(vsum, v);
    }
    vsum = _mm_hadd_ps(vsum, vsum);
    vsum = _mm_hadd_ps(vsum, vsum);
    _mm_store_ss(&sum, vsum);
    return sum;
}

Note: for the above example a must be 16 byte aligned and n must be a multiple of 4. If the alignment of a can not be guaranteed then use _mm_loadu_ps instead of _mm_load_ps. If n is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    If the input array is potentially large, it's worth having a scalar loop at the start, too, that runs 0-3 times until the input is aligned on a 16B boundary for the SSE loop. Then you won't have loads that cross cache/page lines slowing down your loop. And it can use `ADDPS` with a memory operand, which can potentially micro-fuse, reducing overhead. Also, you could get 2 or 4 dependency chains going, by using multiple accumulators, so your loop could sustain 1 vector FP add per cycle, instead of 1 per (latency of `ADDPS` = 3). – Peter Cordes Jul 05 '15 at 14:57
  • @PeterCordes Can you explain about `_mm_add_ps` with a memory operand? Do you mean a pointer argument rather than an `__m128`? – user2023370 Mar 09 '19 at 12:40
  • 2
    @user2023370: `addps` the asm instruction, not `_mm_add_ps` the intrinsic. I'm talking about the compiler's code-gen options. e.g. `addps xmm0, [rdi]` (folding a `_mm_load_ps` into a memory operand that can micro-fuse in the decoders) instead of `movups xmm1, [rdi]` / `addps xmm0, xmm1`. (`_mm_loadu_ps` can't fold unless you have AVX to allow unaligned memory operands.) Of course you still definitely want multiple accumulators to hide FP add latency: see [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?](//stackoverflow.com/q/45113527) for more. – Peter Cordes Mar 09 '19 at 13:04
  • Thanks. That's very helpful. I read your answer to the other question too. Can I check my understanding: if I add one accumulator to the answer above, we might have a `v1` and `v2`, along with a `vsum1` and `vsum2`, and the loop increment by 8 rather than 4? – user2023370 Mar 09 '19 at 21:30