Get sum of values stored in __m256d with SSE/AVX

Question

Is there a way to get sum of values stored in __m256d variable? I have this code.

acc = _mm256_add_pd(acc, _mm256_mul_pd(row, vec));
//acc in this point contains {2.0, 8.0, 18.0, 32.0}
acc = _mm256_hadd_pd(acc, acc);
result[i] = ((double*)&acc)[0] + ((double*)&acc)[2];

This code works, but I want to replace it with SSE/AVX instruction.

My answer on the linked duplicate doesn't actually have an AVX `_pd` version, but it does show how to extract the high 128 for a `__m256` vector. And it explains the general idea for horizontal reductions of narrowing by half every step. — Peter Cordes, Apr 20 '18 at 12:54
BTW, consider producing 4 results for `result[i + 0..3]` in parallel, if convenient, to avoid needing a horizontal sum for every element. — Peter Cordes, Apr 20 '18 at 12:56
Have re-opened as I'd just prepared a solution for doubles that I wanted to post as an answer. — Paul R, Apr 20 '18 at 12:58
Usual technique: write scalar code, compile, read the generated asm (in this case, gcc generates vhaddpd+vperm2f128+vaddpd). — Marc Glisse, Apr 20 '18 at 13:09
@MarcGlisse: What about with `-mtune=znver1`? Does that get it to use `vextractf128` instead of `vpermf128`? Seems like a weird choice in general, even if tuning for an Intel CPU, let alone `tune=generic`. — Peter Cordes, Apr 20 '18 at 13:18
In my test the data is in memory at the beginning, instead of in a register. Tuning for znver1 then either doesn't vectorize, or vectorizes for 128 bits (which actually looks nicer than the 256 bit code). I didn't immediately find a version that would produce vextractf128. — Marc Glisse, Apr 20 '18 at 14:58

Peter Cordes · Accepted Answer · 2022-12-11T21:44:17.183

It appears that you're doing a horizontal sum for every element of an output array. (Perhaps as part of a matmul?) This is usually sub-optimal; try to vectorize over the 2nd-from-inner loop so you can produce result[i + 0..3] in a vector and not need a horizontal sum at all.
For a dot-product of an array larger than one vector, sum vertically (into multiple accumulators), only hsumming once at the end.

For horizontal reductions in general, see Fastest way to do horizontal SSE vector sum (or other reduction) - extract the high half and add to the low half. Repeat until you're down to 1 element.

If you're using this inside an inner loop, you definitely don't want to be using hadd(same,same). That costs 2 shuffle uops instead of 1, unless your compiler saves you from yourself. (And gcc/clang don't.) hadd is good for code-size but pretty much nothing else when you only have 1 vector. It can be useful and efficient with two different inputs.

For AVX, this means the only 256-bit operation we need is an extract, which is fast on AMD and Intel. Then the rest is all 128-bit:

#include <immintrin.h>

inline
double hsum_double_avx(__m256d v) {
    __m128d vlow  = _mm256_castpd256_pd128(v);
    __m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
            vlow  = _mm_add_pd(vlow, vhigh);     // reduce down to 128

    __m128d high64 = _mm_unpackhi_pd(vlow, vlow);
    return  _mm_cvtsd_f64(_mm_add_sd(vlow, high64));  // reduce to scalar
}

If you wanted the result broadcast to every element of a __m256d, you'd use vshufpd and vperm2f128 to swap high/low halves (if tuning for Intel). And use 256-bit FP add the whole time. If you cared about early Ryzen at all, you might reduce to 128, use _mm_shuffle_pd to swap, then vinsertf128 to get a 256-bit vector. Or with AVX2, vbroadcastsd on the final result of this. But that would be slower on Intel than staying 256-bit the whole time while still avoiding vhaddpd.

Compiled with gcc7.3 -O3 -march=haswell on the Godbolt compiler explorer

    vmovapd         xmm1, xmm0               # silly compiler, vextract to xmm1 instead
    vextractf128    xmm0, ymm0, 0x1
    vaddpd          xmm0, xmm1, xmm0
    vunpckhpd       xmm1, xmm0, xmm0         # no wasted code bytes on an immediate for vpermilpd or vshufpd or anything
    vaddsd          xmm0, xmm0, xmm1         # scalar means we never raise FP exceptions for results we don't use
    vzeroupper
    ret

After inlining (which you definitely want it to), vzeroupper sinks to the bottom of the whole function, and hopefully the vmovapd optimizes away, with vextractf128 into a different register instead of destroying xmm0 which holds the _mm256_castpd256_pd128 result.

On first-gen Ryzen (Zen 1 / 1+), according to Agner Fog's instruction tables, vextractf128 is 1 uop with 1c latency, and 0.33c throughput.

@PaulR's version is unfortunately terrible on AMD before Zen 2; it's like something you might find in an Intel library or compiler output as a "cripple AMD" function. (I don't think Paul did that on purpose, I'm just pointing out how ignoring AMD CPUs can lead to code that runs slower on them.)

On Zen 1, vperm2f128 is 8 uops, 3c latency, and one per 3c throughput. vhaddpd ymm is 8 uops (vs. the 6 you might expect), 7c latency, one per 3c throughput. Agner says it's a "mixed domain" instruction. And 256-bit ops always take at least 2 uops.

     # Paul's version                      # Ryzen      # Skylake
    vhaddpd       ymm0, ymm0, ymm0         # 8 uops     # 3 uops
    vperm2f128    ymm1, ymm0, ymm0, 49     # 8 uops     # 1 uop
    vaddpd        ymm0, ymm0, ymm1         # 2 uops     # 1 uop
                           # total uops:   # 18         # 5

vs.

     # my version with vmovapd optimized out: extract to a different reg
    vextractf128    xmm1, ymm0, 0x1        # 1 uop      # 1 uop
    vaddpd          xmm0, xmm1, xmm0       # 1 uop      # 1 uop
    vunpckhpd       xmm1, xmm0, xmm0       # 1 uop      # 1 uop
    vaddsd          xmm0, xmm0, xmm1       # 1 uop      # 1 uop
                           # total uops:   # 4          # 4

Total uop throughput is often the bottleneck in code with a mix of loads, stores, and ALU, so I expect the 4-uop version is likely to be at least a little better on Intel, as well as much better on AMD. It should also make slightly less heat, and thus allow slightly higher turbo / use less battery power. (But hopefully this hsum is a small enough part of your total loop that this is negligible!)

The latency is not worse, either, so there's really no reason to use an inefficient hadd / vpermf128 version.

Zen 2 and later have 256-bit wide vector registers and execution units (including shuffle). They don't have to split lane-crossing shuffles into many uops, but conversely vextractf128 is no longer about as cheap as vmovdqa xmm. Zen 2 is a lot closer to Intel's cost model for 256-bit vectors.

Paul R · Answer 2 · 2018-04-22T16:42:38.743

7

You can do it like this:

acc = _mm256_hadd_pd(acc, acc);    // horizontal add top lane and bottom lane
acc = _mm256_add_pd(acc, _mm256_permute2f128_pd(acc, acc, 0x31));  // add lanes
result[i] = _mm256_cvtsd_f64(acc); // extract double

Note: if this is in a "hot" (i.e. performance-critical) part of your code (especially if running on an AMD CPU) then you might instead want to look at Peter Cordes's answer regarding more efficient implementations.

edited Apr 22 '18 at 16:42

answered Apr 20 '18 at 12:57

Paul R

208,748
37
389
560

2

On Ryzen this is *terrible*, BTW. permutef128 is *much* slower than extract, and there's no reason to do any 256-bit vector operations for this because you want a single scalar result and don't need the result broadcast to every element. On Intel the only sub-optimal part is `hadd`. – Peter Cordes Apr 20 '18 at 13:13
This is exactly what I looked for! Thanks! – Peter Apr 20 '18 at 13:14
Ah - don't know anything about AMD CPUs - what do you suggest as an alternative ? – Paul R Apr 20 '18 at 13:14
What I said in comments and in [Fastest way to do horizontal float vector sum on x86](//stackoverflow.com/q/6996764): extract down to 128, then do a 128-bit hsum the SSE2 way: with a shuffle + add. Not worse for Intel, and much better for AMD. `vextractf128` is nearly free on AMD, according to Agner Fog's numbers: it's not even a shuffle because vectors are already split in half internally. – Peter Cordes Apr 20 '18 at 13:15
Your version on Ryzen: 18 uops. My version: 4. It's a surprisingly big difference if you haven't thought about CPUs that split 256b ops to 128. Mine saves a uop on Intel as well from avoiding `hadd`. (AMD has worse `hadd` than I'd remembered; apparently it's a "mixed domain" instruction: integer / fp. Like they don't even care much about making it fast or something, or at least the 256-bit version which is probably more rarely useful because of its in-lane behaviour. This case doesn't count as useful because you can do the same thing faster with a separate single shuffle.) – Peter Cordes Apr 20 '18 at 14:09
Thanks Peter - useful to know - we don't currently do anything with AMD CPUs but I have to keep an eye on them as the technology evolves - they do seem to be catching up with Intel again gradually. – Paul R Apr 20 '18 at 14:21
2

Yeah, for your own use you don't need to care about AMD if you don't use them. But for SO answers, it's definitely a good idea to write code that's efficient on AMD as well, especially when it's equal on Intel. I think it's healthier for the industry if Intel continues to have some competition, and having random pieces of software that copy code from SO answers not gimp themselves on AMD CPUs helps with that. And the people that copy your code will probably be happier if their code runs well on more CPUs :P Fortunately, reducing down to 128 first is good and easy to remember. – Peter Cordes Apr 21 '18 at 02:47

score 3 · Answer 3 · edited Dec 21 '22 at 03:41

3

In gcc and clang SIMD types are built-in vector types. E.g.:

# avxintrin.h
typedef double __m256d __attribute__((__vector_size__(32), __aligned__(32)));

These built-in vectors support indexing, so you can write it conveniently and leave it up to the compiler to make good code:

double hsum_double_avx2(__m256d v) {
    return v[0] + v[1] + v[2] + v[3];
}

clang-14 -O3 -march=znver3 -ffast-math generates the same assembly as it does for Peter Cordes's intrinsics:

# clang -O3 -ffast-math
hsum_double_avx2:
        vextractf128    xmm1, ymm0, 1
        vaddpd  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        vzeroupper
        ret

Unfortunately gcc does much worse, which generates sub-optimal instructions, not taking advantage of the freedom to re-associate the 3 + operations, and using vhaddpd xmm to do the v[0] + v[1] part, which costs 4 uops on on Zen 3. (Or 3 uops on Intel CPUs, 2 shuffles + an add.)

-ffast-math is of course necessary for the compiler to be able to do a good job, unless you write it as (v[0]+v[2]) + (v[1]+v[3]). With that, clang still makes the same asm with -O3 -march=icelake-server without -ffast-math.

Ideally, I want to write plain code as I did above and let the compiler use a CPU-specific cost model to emit optimal instructions in right order for that specific CPU.

One reason being that labour-intensive hand-coded optimal version for Haswell may well be suboptimal for Zen3. For this problem specifically, that's not really the case: starting by narrowing to 128-bit with vextractf128 + vaddpd is optimal everywhere. There are minor variations in shuffle throughput on different CPUs; for example Ice Lake and later Intel can run vshufps on port 1 or 5, but some shuffles like vpermilps/pd or vunpckhpd still only on port 5. Zen 3 (like Zen 2 and 4) has good throughput for either of those shuffles so clang's asm happens to be good there. But it's unfortunate that clang -march=icelake-server still uses vpermilpd

A frequent use-case nowadays is computing in the cloud with diverse CPU models and generations, compiling the code on that host with -march=native -mtune=native for best performance.

In theory, if compilers were smarter, this would optimize short sequences like this to ideal asm, as well as making generally good choices for heuristics like inlining and unrolling. It's usually the best choice for a binary that will run on only one machine, but as GCC demonstrates here, the results are often far from optimal. Fortunately modern AMD and Intel aren't too different most of the time, having different throughputs for some instructions but usually being single-uop for the same instructions.

edited Dec 21 '22 at 03:41

Peter Cordes

328,167
45
605
847

answered Dec 19 '22 at 18:49

Maxim Egorushkin

131,725
17
180
271

Almost identical??? Mine has 2 shuffle uops (`vextractf128` and `vunpckhpd`) and 2 `add` uops (one pd, one sd). GCC's mess has 4 shuffles (2 from `vhaddpd xmm`, plus `vextractf128` and `vunpckhpd`), and 3 adds (1 from `vhaddpd`, and two scalar adds). Much of that badness is required without `-ffast-math` to let it re-associate FP math. – Peter Cordes Dec 19 '22 at 20:35
You might have better luck with `(v[0]+v[1]) + (v[2]+v[3])` which might at least let it use `vhaddpd ymm`, although that still sucks. `(v[0]+v[2]) + (v[1]+v[3])` has the same temporaries as the manually vectorized version so one might hope it would compile that way even without `-ffast-math`. Clang does get it right: https://godbolt.org/z/Wvjbzc1rj but GCC still makes a mess. With `-ffast-math`, clang gets it right for any version (same asm as my manually-vectorized version), but GCC is still just as bad: https://godbolt.org/z/ejMTGbEKM – Peter Cordes Dec 19 '22 at 20:35
@PeterCordes Didn't mean to irk you as much. How much is your version faster than gcc and clang generated versions in a micro-benchmark on modern hardware? – Maxim Egorushkin Dec 20 '22 at 14:33
@PeterCordes Ideally, I want to write plain code as I did above and let the compiler use the CPU-specific cost model to emit optimal instructions in right order for that specific CPU. Specifically, I want to give the compiler maximum flexibility by not imposing evaluation order with parenthesis. Not sure if that will ever happen with gcc, but with clang I am fairly optimistic. – Maxim Egorushkin Dec 20 '22 at 15:02
Mine would be twice as fast for throughput on Haswell/Skylake, like I said 2 shuffles instead of 4. Other than GCC's braindead wasted `vmovapd xmm1, xmm0` at the start, mine is also ~optimal on Zen1 through Zen4, and Ice Lake / Alder Lake. All 4 of the shuffle uops need port 5 on Intel (except Alder Lake E-cores), unfortunately not like `vshufps` on Ice Lake. There are no modern CPUs where `vhaddpd` is efficient; it's 3 uops on Intel, 4 uops on AMD. If you don't care about performance for your cleanup code, just say so. Please don't make wrong statements like the asm being at all similar. – Peter Cordes Dec 20 '22 at 18:36
For latency, gcc also has 3 `addsd` latencies on the critical path, vs. 2 for mine with the tree pattern. And the inevitable resource conflict for 2 shuffles from `vhaddpd` makes it higher latency, too. At least `vextractf128` can run in parallel with that, but it's still worse. Some AMD CPUs don't suffer as badly from this version, with better shuffle throughput than Intel, but `vhaddpd` costing an extra uop on AMD does hurt. I don't know of any CPUs where narrowing to 128-bit as the first step isn't optimal. For Ice Lake, a shuffle that could run on p15 would be a small win vs `vunpckhpd` – Peter Cordes Dec 20 '22 at 18:38
I think it would be a big improvement to your answer to point out that you can get actually equal code by compiling your source with `clang -O3 -ffast-math`, which allows it to re-associate into the order that it can do with efficient shuffles. (And that GCC doesn't usefully invent shuffles to vectorize the reduction.) Your point about optimal-for-whatever-microarchitecture only makes any sense if your compiler has the freedom to optimize (fast-math), and uses that freedom in a useful way (clang, not GCC). – Peter Cordes Dec 20 '22 at 18:49
BTW, I'm not personally irked by the comparison to my code, just by this misleading answer suggesting to future readers that this is anywhere near as efficient. Often the efficiency of cleanup code doesn't matter after a long-running loop, but if people went looking for an answer, IMO they should find an efficient one. Fewer uops for the front-end so it can start issuing independent work a cycle sooner, and hopefully packs better into the uop cache. – Peter Cordes Dec 20 '22 at 18:53
@PeterCordes Fair enough, I stand corrected. – Maxim Egorushkin Dec 20 '22 at 22:53
@PeterCordes You are too kind with your generous amendmends, thank you. – Maxim Egorushkin Dec 21 '22 at 20:04

Get sum of values stored in __m256d with SSE/AVX

3 Answers3

Linked

Related