SIMD reduce 4 vectors without hadd

Question

I'm trying to optimize some code and I'm at a state where I have 4 vectors __m256d and I want to store the sum of each of them inside another __m256d. So basically result = [sum(a), sum(b), sum(c), sum(d)]. I know there is a way to do this using 2 hadds a blend and permute but I realised that hadd is too expensive.

So I was wondering if there is an intrinsic that allows to do this faster.

Does this answer your question? [SIMD optimization of a curve computed from the second derivative](https://stackoverflow.com/questions/47983660/simd-optimization-of-a-curve-computed-from-the-second-derivative) — Severin Pappadeux, Mar 21 '20 at 17:46
Mmh no because the final result is not in 4 variables that have to be summed up like in my case, but they exploit the problem definition. Or did I miss something on how they do what I asked? — DevX10, Mar 21 '20 at 18:58
you can emulate hadd with 2 `vshufpd` + a vertical add, but that's exactly the same cost on some CPUs. (Although IIRC slightly cheaper on Zen1 where `hadd` is for some reason 4 uops per lane.) Transpose and sum is inherently not cheap, but shuffling vectors together to feed vertical adds is your best bet. Perhaps 2x vperm2f128 could be good, although that's slow on Zen1. — Peter Cordes, Mar 21 '20 at 21:11
You need to transpose your matrix, then an ordinary SIMD "add" will work. — EOF, Mar 21 '20 at 21:26
@PeterCordes could you elaborate on the 2x vperm2f128. Because I tried using that and also some blend and 3 adds but I get the same performance as using hadds. Maybe using vperm2f128 the way you intend is faster. I would appreciate it, thanks — DevX10, Mar 21 '20 at 21:34

EOF · Answer 1 · 2020-03-22T11:10:17.720

Three options:

1 matrix transpose, then vertical sum

good: conceptually simple, using a generally useful algorithm (matrix transpose), portable code

bad: code size, latency, throughput

2 using vhaddpd efficiently

good: small code (good for Icache), good latency and throughput on Intel uArchs

bad: requires architecture specific code, problematic on some uArch

3 partial transpose, sum, partial transpose, sum

good: good latency, good throughput

bad: not quite as small as the vhaddpd-code, not as easy to understand as the full matrix transpose

Matrix Transpose, Vertical Sum

Have your compiler optimize this for you. With gcc vector extensions*, code to sum over a transposed matrix could look like this:

#include <stdint.h>

typedef uint64_t v4u64 __attribute__((vector_size(32)));
typedef double v4f64  __attribute__((vector_size(32)));

v4f64 dfoo(v4f64 sv0, v4f64 sv1, v4f64 sv2, v4f64 sv3)
{
  v4f64 tv[4];
  tv[0] = __builtin_shuffle(sv0, sv1, (v4u64){0,4,2,6});
  tv[1] = __builtin_shuffle(sv0, sv1, (v4u64){1,5,3,7});
  tv[2] = __builtin_shuffle(sv2, sv3, (v4u64){0,4,2,6});
  tv[3] = __builtin_shuffle(sv2, sv3, (v4u64){1,5,3,7});
  v4f64 fv[4];
  fv[0] = __builtin_shuffle(tv[0], tv[2], (v4u64){0,1,4,5});
  fv[1] = __builtin_shuffle(tv[0], tv[2], (v4u64){2,3,6,7});
  fv[2] = __builtin_shuffle(tv[1], tv[3], (v4u64){0,1,4,5});
  fv[3] = __builtin_shuffle(tv[1], tv[3], (v4u64){2,3,6,7});
  return fv[0]+fv[1]+fv[2]+fv[3];
}

gcc-9.2.1 produces the following assembly:

dfoo:
    vunpcklpd   %ymm3, %ymm2, %ymm5
    vunpcklpd   %ymm1, %ymm0, %ymm4
    vunpckhpd   %ymm1, %ymm0, %ymm0
    vinsertf128 $1, %xmm5, %ymm4, %ymm1
    vperm2f128  $49, %ymm5, %ymm4, %ymm4
    vunpckhpd   %ymm3, %ymm2, %ymm2
    vaddpd  %ymm4, %ymm1, %ymm1
    vinsertf128 $1, %xmm2, %ymm0, %ymm3
    vperm2f128  $49, %ymm2, %ymm0, %ymm0
    vaddpd  %ymm3, %ymm1, %ymm1
    vaddpd  %ymm0, %ymm1, %ymm0
    ret

Agner Fog's tables say:

vunpck[h/l]pd: 1 cycle latency, 1 per cycle throughput, 1 uOP port5.
vinsertf128: 3 cycles latency, 1 per cycle throughput, 1 uOP port5.
vperm2f128: 3 cycles latency, 1 per cycle throughput, 1 uOP port5.
vaddpd: 4 cycles latency, 2 per cycle throughput, 1 uOP port01.

In all, there are

4 [unpack] + 2 [insert] + 2 [permute] = 8 port5 uOPs.
3 [add] = 3 port01 uOPs.

Throughput will bottleneck on port5. Latency is pretty bad at about ~18 cycles. Code size is about 60 bytes.

Horizontal Sum

Code (sensibly) using vhadd is not easy to get via gcc vector extensions, so the code needs Intel-specific intrinsics:

v4f64 dfoo_hadd(v4f64 sv0, v4f64 sv1, v4f64 sv2, v4f64 sv3)
{
  v4f64 hv[2];
  hv[0] = __builtin_ia32_haddpd256(sv0, sv1); //[00+01, 10+11, 02+03, 12+13]
  hv[1] = __builtin_ia32_haddpd256(sv2, sv3); //[20+21, 30+31, 22+23, 32+33]
  v4f64 fv[2];
  fv[0] = __builtin_shuffle(hv[0], hv[1], (v4u64){0, 1, 4, 5}); //[00+01, 10+11, 20+21, 30+31]
  fv[1] = __builtin_shuffle(hv[0], hv[1], (v4u64){2, 3, 6, 7}); //[02+03, 12+13, 22+23, 32+33]
  return fv[0] + fv[1]; //[00+01+02+03, 10+11+12+13, 20+21+22+23, 30+31+32+33]
}

this generates the following asssembly:

dfoo_hadd:
    vhaddpd %ymm3, %ymm2, %ymm2
    vhaddpd %ymm1, %ymm0, %ymm0
    vinsertf128 $1, %xmm2, %ymm0, %ymm1
    vperm2f128  $49, %ymm2, %ymm0, %ymm0
    vaddpd  %ymm0, %ymm1, %ymm0
    ret

According to Agner Fog's instruction tables,

vhaddpd: 6 cycles latency, 0.5 per cycle throughput, 3 uOPS port01 + 2*port5.

In all, there are

4 [hadd] + 2 [insert/permute] = 6 uOPs port5.
3 [hadd/add] = 3 uOPs port01.

Throughput is also limited by port5, and this has more throughput than the transposing code. Latency should be about ~16 cycles, also faster than the transposing code. Code size is about 25 bytes.

Partial Transpose, Sum, Partial Transpose, Sum

Implementing @PeterCordes comment:

v4f64 dfoo_PC(v4f64 sv0, v4f64 sv1, v4f64 sv2, v4f64 sv3)
{
  v4f64 tv[4];
  v4f64 av[2];
  tv[0] = __builtin_shuffle(sv0, sv1, (v4u64){0,4,2,6});//[00, 10, 02, 12]
  tv[1] = __builtin_shuffle(sv0, sv1, (v4u64){1,5,3,7});//[01, 11, 03, 13]
  av[0] = tv[0] + tv[1];//[00+01, 10+11, 02+03, 12+13]
  tv[2] = __builtin_shuffle(sv2, sv3, (v4u64){0,4,2,6});//[20, 30, 22, 32]
  tv[3] = __builtin_shuffle(sv2, sv3, (v4u64){1,5,3,7});//[21, 31, 23, 33]
  av[1] = tv[2] + tv[3];//[20+21, 30+31, 22+23, 32+33]
  v4f64 fv[2];
  fv[0] = __builtin_shuffle(av[0], av[1], (v4u64){0,1,4,5});//[00+01, 10+11, 20+21, 30+31]
  fv[1] = __builtin_shuffle(av[0], av[1], (v4u64){2,3,6,7});//[02+03, 12+13, 22+23, 32+33]
  return fv[0]+fv[1];//[00+01+02+03, 10+11+12+13, 20+21+22+23, 30+31+32+33]
}

This generates:

dfoo_PC:
    vunpcklpd   %ymm1, %ymm0, %ymm4
    vunpckhpd   %ymm1, %ymm0, %ymm1
    vunpcklpd   %ymm3, %ymm2, %ymm0
    vunpckhpd   %ymm3, %ymm2, %ymm2
    vaddpd  %ymm1, %ymm4, %ymm1
    vaddpd  %ymm2, %ymm0, %ymm2
    vinsertf128 $1, %xmm2, %ymm1, %ymm0
    vperm2f128  $49, %ymm2, %ymm1, %ymm1
    vaddpd  %ymm1, %ymm0, %ymm0
    ret

In all, there are

4 [unpack] + 2 [insert/permute] = 6 port5 uOPs.
3 [add] = 3 port01 uOPs.

This gets to the same number of port5 uOPs as the hadd-code. The code still bottlenecks on port5, latency is about ~16 cycles. Code size is about 41 bytes.

If you wanted to increase throughput, you would have to shift work away from port5. Unfortunately, almost all permute/insert/shuffle instructions require port5, and lane-crossing instructions (which are required here) have a minimum of 3 cycles latency. One interesting instruction that almost helps is vblendpd, which has 3/cycle throughput, 1 cycle latency, and can execute on port015, but using it to replace one of the permute/insert/shuffles would require a 64-bit shift of a 128-bit lane of a vector, which is implemented by vpsrldq/vpslldq, which -you guessed it- takes a port5 uOP (so this would help with vectors of 32-bit float, because vpsllq/vpsrlq do not require port5). There is no free lunch here.

* gcc vector extensions quick description:

The code is using gcc vector extensions, which allow using basic operators (+-*/=><>><< etc.) on vectors, operating element-wise. They also include a few __builtin_* functions, in particular __builtin_shuffle(), which has a 3-operand form where the first two are two (same-length N) vectors of the same type T, which (logically) are concatenated to a double-length (2N) vector of that type T, the third is a vector of an integer type (IT) the same width and length (N) as the type of the original vectors. The result is a vector of the same type T and width N of the original vector, with elements selected by the indices in the integer-type vector.

Originally, my answer was about uint64_t, kept here for context:

 #include <stdint.h>

typedef uint64_t v4u64 __attribute__((vector_size(32)));

v4u64 foo(v4u64 sv0, v4u64 sv1, v4u64 sv2, v4u64 sv3)
{
  v4u64 tv[4];
  tv[0] = __builtin_shuffle(sv0, sv1, (v4u64){0,4,2,6});
  tv[1] = __builtin_shuffle(sv0, sv1, (v4u64){1,5,3,7});
  tv[2] = __builtin_shuffle(sv2, sv3, (v4u64){0,4,2,6});
  tv[3] = __builtin_shuffle(sv2, sv3, (v4u64){1,5,3,7});
  v4u64 fv[4];
  fv[0] = __builtin_shuffle(tv[0], tv[2], (v4u64){0,1,4,5});
  fv[1] = __builtin_shuffle(tv[0], tv[2], (v4u64){2,3,6,7});
  fv[2] = __builtin_shuffle(tv[1], tv[3], (v4u64){0,1,4,5});
  fv[3] = __builtin_shuffle(tv[1], tv[3], (v4u64){2,3,6,7});
  return fv[0]+fv[1]+fv[2]+fv[3];
}

The translation generated by gcc-9.2.1 on skylake-avx2 could look like this:

foo:
    vpunpcklqdq %ymm3, %ymm2, %ymm5
    vpunpcklqdq %ymm1, %ymm0, %ymm4
    vpunpckhqdq %ymm3, %ymm2, %ymm2
    vpunpckhqdq %ymm1, %ymm0, %ymm0
    vperm2i128  $32, %ymm2, %ymm0, %ymm3
    vperm2i128  $32, %ymm5, %ymm4, %ymm1
    vperm2i128  $49, %ymm2, %ymm0, %ymm0
    vperm2i128  $49, %ymm5, %ymm4, %ymm4
    vpaddq  %ymm4, %ymm1, %ymm1
    vpaddq  %ymm0, %ymm3, %ymm0
    vpaddq  %ymm0, %ymm1, %ymm0
    ret

Note that the assembly has an almost line for line correspondence to the gcc vector extensions.

According to Agner Fog's instruction tables for Skylake,

vpunpck[h/l]qdq: 1 cycle latency, 1 per cycle throughput, port5.
vperm2i128: 3 cycles latency, 1 per cycle throughput, port5.
vpaddq: 1 cycle latency, 3 per cycle throughput, ports015.

So the transpose takes 10 cycles (4 for unpack, 4 throughput + 2 latency for permute). Of the three adds, only two can be executed in parallel, so that would take 2 cycles, for 12 cycles total.

`__m256d` is double, not int64_t. FP versions of everything are available with the same performance, except FP add is 0.5c throughput, 4c latency on SKL. — Peter Cordes, Mar 21 '20 at 23:34
Your perf analysis should state more clearly that "takes 12 cycles" is latency, not throughput. If you did this back-to-back for *independent* inputs, you could do one iteration (of 4 vectors -> 1) per 8 clocks, only bottlenecked on shuffle throughput. With plenty of front-end bandwidth left over for loading/storing results. — Peter Cordes, Mar 21 '20 at 23:39
And BTW, is there a way we can `hadd` for the first step, reducing to only 2 vectors? Or does that complicate the shuffling such that you'd need AVX512 `vpermt2pd` or something to line up the right elements across lanes for a final vertical add, or an extra lane-crossing shuffle like AVX2 `vpermpd` after `vperm2f128`? — Peter Cordes, Mar 21 '20 at 23:43
I forgot to mention that i work with doubles, so `vpunpcklqdq` wouldn't be good. Probably gonna use `vunpcklpd` but throughput is worse. And what does the first block of code represent? Because I cannot understand how the permutations are exactly performed — DevX10, Mar 21 '20 at 23:47
@DevX10: You should use the version that compiles to 2x `vhaddpd` to feed vinsertf128 + vperm2f128 + `vaddpd`. That looks very good. EOF I think got mixed up with which types you wanted, but that's fixed in an update which also optimizes by doing the first step of adding before the 2nd step of shuffling, resulting in less total shuffling being needed. (Fun fact: `vpunpcklqdq` on `double` vectors is not actually a problem on Intel CPUs. But it has no advantage either so yes you'd prefer FP shuffles.) — Peter Cordes, Mar 22 '20 at 09:44
@EOF: Actually using `haddpd` is not super important, and in fact a loss on some AMD CPUs. Using vector extensions but still doing one step of adding before the lane-crossing shuffling would still get you down to 6 shuffle uops like your hadd version, instead of 8 like the worse version you present first. — Peter Cordes, Mar 22 '20 at 09:47
@PeterCordes Good point. Added a version that adds in-between the shuffling, matching the `hadd` port5 utilization, thanks. — EOF, Mar 22 '20 at 10:11
Nice, but you might want to put the best versions first for the benefit of future readers. Or a TL:DR section that says which version people should actually use. — Peter Cordes, Mar 22 '20 at 10:16
@PeterCordes I'm estimating the newest code to have about the same latency as the `hadd`-code on skylake. I suppose if I wanted to give a recommendation, I'd have to also make throughput/latency estimates for other uArchs, which I'm much less familiar with (and estimating latency is pretty difficult for me). — EOF, Mar 22 '20 at 10:56
Yeah, `hadd` on SKL is just a compact way (in machine code) to encode the same 3 uops you're using manually (2 shuffle + 1 vertical). On some AMD CPUs `hadd` is more uops than you'd expect, vperm2f128 is also a lot more so if you really wanted to tune for AMD Zen1 you'd decompose it to `vextractf128` + `vinsertf128`. For generic you'd probably recommend either the hadd or the equivalent separate-shuffle version. On Intel there should be no difference in their critical-path latency, and if you're using SIMD for data-parallel problems then hopefully you bottleneck on throughput not latency. — Peter Cordes, Mar 22 '20 at 11:23
@PeterCordes Just had a look at Agner Fog's tables for Zen/Ryzen, remembered that AMD doesn't have 256-bit vector ALUs. I wouldn't be surprised if on AMD you'd be better off splitting the whole matrix into 8x2 doubles. Anyway, I'm unlikely to do *yet another* dive into uArch specifics and code *yet another* alternative for a uArch I don't even have access to. — EOF, Mar 22 '20 at 11:30
@EOF: Zen2 has 256-bit vector ALUs; only AMD *before that* splits everything into 128-bit uops (and makes lane-crossing shuffles extra expensive, except insert/extract128 are super cheap). I haven't looked at shuffles on Zen2 in detail, but https://uops.info/ has the full test results. Your answer is fine; perf analysis for Skylake is more than enough without getting into Zen1 vs. Zen2, and which shuffles IceLake can run on its extra shuffle unit that's not on port 5. (Yes it has better than 1c throughput for some shuffles, IIRC some common in-lane ones.) — Peter Cordes, Mar 22 '20 at 12:19

SIMD reduce 4 vectors without hadd

1 Answers1

Matrix Transpose, Vertical Sum

Horizontal Sum

Partial Transpose, Sum, Partial Transpose, Sum