How to sum __m256 horizontally?

Question

I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use

_mm_hadd_ps(xmm,xmm);
_mm_hadd_ps(xmm,xmm);

to get the result at the first component of the vector, but this does not scale with the 256 bit version of the function (_mm256_hadd_ps).

What is the best way to compute the horizontal sum of a __m256 vector?

use sse to compute horizontal sum of the lower part; shuffle YMM high / low parts, use sse again and sum up the two scalars. or wait for avx2. — Aki Suihkonen, Nov 04 '12 at 14:25
It's inside an outer loop where there is another inner loop. — Yoav, Nov 04 '12 at 15:57
See also [this 128b SSE answer](http://stackoverflow.com/a/35270026/224132) for more optimal (lower latency, fewer uops) alternatives to `haddps` after you've done the `vextractf128` / `addps` step. — Peter Cordes, Feb 17 '16 at 07:58

score 16 · Accepted Answer · edited Feb 16 '20 at 16:31

16

This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer, and later CPUs.

// x = ( x7, x6, x5, x4, x3, x2, x1, x0 )
float sum8(__m256 x) {
    // hiQuad = ( x7, x6, x5, x4 )
    const __m128 hiQuad = _mm256_extractf128_ps(x, 1);
    // loQuad = ( x3, x2, x1, x0 )
    const __m128 loQuad = _mm256_castps256_ps128(x);
    // sumQuad = ( x3 + x7, x2 + x6, x1 + x5, x0 + x4 )
    const __m128 sumQuad = _mm_add_ps(loQuad, hiQuad);
    // loDual = ( -, -, x1 + x5, x0 + x4 )
    const __m128 loDual = sumQuad;
    // hiDual = ( -, -, x3 + x7, x2 + x6 )
    const __m128 hiDual = _mm_movehl_ps(sumQuad, sumQuad);
    // sumDual = ( -, -, x1 + x3 + x5 + x7, x0 + x2 + x4 + x6 )
    const __m128 sumDual = _mm_add_ps(loDual, hiDual);
    // lo = ( -, -, -, x0 + x2 + x4 + x6 )
    const __m128 lo = sumDual;
    // hi = ( -, -, -, x1 + x3 + x5 + x7 )
    const __m128 hi = _mm_shuffle_ps(sumDual, sumDual, 0x1);
    // sum = ( -, -, -, x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 )
    const __m128 sum = _mm_add_ss(lo, hi);
    return _mm_cvtss_f32(sum);
}

haddps is not efficient on any CPU; the best you can do is one shuffle (to extract the high half) and one add, repeat until one element left. Narrowing to 128-bit as the first step benefits AMD before Zen2, and is not a bad thing anywhere.

See Fastest way to do horizontal SSE vector sum on x86 for more details about efficiency.

edited Feb 16 '20 at 16:31

Peter Cordes

328,167
45
605
847

answered Nov 04 '12 at 20:12

Marat Dukhan

11,993
4
27
41

There are some weird corner cases (when performance is decode-bound) where using `haddps` instead would confer a benefit, but generally this is very reasonable. – Stephen Canon Nov 05 '12 at 01:46
On Bulldozer haddps is microcoded. Moreover, it will generate 3 macrooperations, while the code above uses only 2 for partial reduction. – Marat Dukhan Nov 05 '12 at 13:56
2

which is why I said "weird corner cases" (they are very rare, and truly weird). – Stephen Canon Nov 05 '12 at 14:09
1

Doesn't the use of SSE instructions (like _mm_movehl_ps) with 256bit AVX instructions incur a state change penalty? – timbo Nov 14 '15 at 23:34
1

SSE instructions do cause state change penalty, but if you compile for AVX instruction sets, `_mm_movehl_ps` and the likes would generate AVX forms of the instructions (`VMOVHLPS` in this particular case). – Marat Dukhan Nov 15 '15 at 03:09

score 7 · Answer 2 · answered Nov 04 '12 at 15:58

7

This can be done with the following code:

ymm2 = _mm256_permute2f128_ps(ymm , ymm , 1);
ymm = _mm256_add_ps(ymm, ymm2);
ymm = _mm256_hadd_ps(ymm, ymm);
ymm = _mm256_hadd_ps(ymm, ymm);

but there might be a better solution.

answered Nov 04 '12 at 15:58

Yoav

5,962
5
39
61

2

I noticed that the permute + add can also come *after* the two hadds. – user2023370 Apr 04 '19 at 23:21

How to sum __m256 horizontally?

2 Answers2

Linked

Related