Fastest way to do horizontal vector sum with AVX instructions

Question

I have a packed vector of four 64-bit floating-point values.
I would like to get the sum of the vector's elements.

With SSE (and using 32-bit floats) I could just do the following:

v_sum = _mm_hadd_ps(v_sum, v_sum);
v_sum = _mm_hadd_ps(v_sum, v_sum);

Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I believe this is due to the fact that most AVX instructions work as SSE instructions for each low and high 128-bits separately, without ever crossing the 128-bit boundary.

Ideally, the solution I am looking for should follow these guidelines:
1) only use AVX/AVX2 instructions. (no SSE)
2) do it in no more than 2-3 instructions.

However, any efficient/elegant way to do it (even without following the above guidelines) is always well accepted.

Thanks a lot for any help.

-Luigi Castelli

Start with `_mm256_extractf128_ps`, `_mm_add_ps` the two halves together, then use [the existing methods for reducing a 128b vector](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86). — Peter Cordes, Apr 12 '16 at 21:11

Jason R · Accepted Answer · 2022-05-11T15:28:59.690

29

If you have two __m256d vectors x1 and x2 that each contain four doubles that you want to horizontally sum, you could do:

__m256d x1, x2;
// calculate 4 two-element horizontal sums:
// lower 64 bits contain x1[0] + x1[1]
// next 64 bits contain x2[0] + x2[1]
// next 64 bits contain x1[2] + x1[3]
// next 64 bits contain x2[2] + x2[3]
__m256d sum = _mm256_hadd_pd(x1, x2);
// extract upper 128 bits of result
__m128d sum_high = _mm256_extractf128_pd(sum, 1);
// add upper 128 bits of sum to its lower 128 bits
__m128d result = _mm_add_pd(sum_high, _mm256_castpd256_pd128(sum));
// lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
// upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]

So it looks like 3 instructions will do 2 of the horizontal sums that you need. The above is untested, but you should get the concept.

edited May 11 '22 at 15:28

answered Mar 19 '12 at 19:24

Jason R

11,159
6
50
81

Instead of using casting with (__m128d) I think you should use _mm256_castpd256_pd128(sum). – Z boson Dec 05 '13 at 17:22
What compiler did you use for this? I asked a question about your cast here [c-style-cast-versus-intrinsic-cast](http://stackoverflow.com/questions/20401413/c-style-cast-versus-intrinsic-cast). I don't have access to system to test this on right now. – Z boson Dec 05 '13 at 18:27
@Zboson: I edited the snippet to use the more standard casting method. – Jason R Feb 12 '14 at 13:25
This is less efficient than possible on Intel CPUs, and very inefficient on AMD CPUs. See [Get sum of values stored in \_\_m256d with SSE/AVX](//stackoverflow.com/q/49941645). It's not wrong, but the question asked for "fastest", not simple or compact / small code-size. – Peter Cordes Jan 29 '19 at 19:50
I think I misread this earlier. You're doing 2 separate horizontal sums. That's actually good on Intel, and maybe ok on Zen or Zen2. But not if you're ultimately going to add the two halves of `__m128d result` to each other. So to make good use of this, you need to have two independent things you need to hsum. – Peter Cordes Mar 31 '20 at 06:19
simple typo (can't edit), I believe in the line __m128d sum_high = _mm256_extractf128_pd(sum1, 1); sum1 should be sum – J. Doe May 11 '22 at 13:53

score 4 · Answer 2 · answered Dec 05 '13 at 01:39

4

If you want just the sum, and a bit of scalar code is acceptable:

__m256d x;
__m256d s = _mm256_hadd_pd(x,x);
return ((double*)&s)[0] + ((double*)&s)[2];

answered Dec 05 '13 at 01:39

RJVB

698
8
18

6

Smells like undefined behavior with that cast. – wingerse Oct 21 '18 at 23:05
1

@wingerse: Indeed, support for Intel intrinsics does *not* imply well-defined behaviour for pointing a plain type into a `__mxxx` type, only the other way around. [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) This is strict-aliasing UB, and may break in practice on GCC or clang. – Peter Cordes May 09 '22 at 21:41

score 4 · Answer 3 · edited May 09 '22 at 21:03

Assuming the following, that you have a __m256d vector containing 4 packed doubles and you would like to calculate the sum of its components, ie a0, a1, a2, a3 is each double component you would like a0 + a1 + a2 + a3 then heres another AVX solution:

// goal to calculate a0 + a1 + a2 + a3
__m256d values = _mm256_set_pd(23211.24, -123.421, 1224.123, 413.231);

// assuming _mm256_hadd_pd(a, b) == a0 + a1, b0 + b1, a2 + a3, b2 + b3 (5 cycles) ...
values = _mm256_hadd_pd(values, _mm256_permute2f128_pd(values, values, 1));
// ^^^^^^^^^^^^^^^^^^^^ a0 + a1, a2 + a3, a2 + a3, a0 + a1

values = _mm256_hadd_pd(values, values);
// ^^^^^^^^^^^^^^^^^^^^ (a0 + a1 + a2 + a3), (a0 + a1 + a2 + a3), (a2 + a3 + a0 + a1), (a2 + a3 + a0 + a1)

// Being that addition is associative then each component of values contains the sum of all its initial components (11 cycles) to calculate, (1-2 cycles) to extract, total (12-13 cycles)
double got = _mm_cvtsd_f64(_mm256_castpd256_pd128(values)), exp = (23211.24 + -123.421 + 1224.123 + 413.231);

if (got != exp || _mm256_movemask_pd(_mm256_cmp_pd(values, _mm256_set1_pd(exp), _CMP_EQ_OS)) != 0b1111)
    printf("Failed to sum double components, exp: %f, got %f\n", exp, got);
else
    printf("ok\n");

This solution has the sum broadcasted which maybe useful ...

If I misinterpreted the problem I do apologize.

$ uname -a
Darwin Samys-MacBook-Pro.local 13.3.0 Darwin Kernel Version 13.3.0: Tue Jun  3 21:27:35 PDT 2014; root:xnu-2422.110.17~1/RELEASE_X86_64 x86_64

$ gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix

If you want the result sum to all elements, hadd is still not the most efficient way. You only need 1 shuffle + add, not HADD's 2x shuffle uops that feed one add. e.g. `val = add(val, permute2f128(val,val 1));` then swap within 128-bit lanes with `_mm256_shuffle_pd` to feed another add. On Intel that's only 4 uops total, same as [Get sum of values stored in \_\_m256d with SSE/AVX](//stackoverflow.com/q/49941645) — Peter Cordes, Jan 29 '19 at 19:53

Fastest way to do horizontal vector sum with AVX instructions

3 Answers3

Linked