3

While crawling through the intrinsics avaiable, I've noticed there's nowhere to be seen an horizontal addsub/subadd intruction avaiable. It's avaiable in the obsolete 3DNow! extension however it's use is impratical for obvious reasons. What is the reason for such "basic" operation not being implemented in the SSE3 extension along with similar horizontal and addsub operations?

And by the way what's the fastest alternative in a modern instruction set (SSE3, SSE4, AVX, ...)? (with 2 doubles per value i.e. __m128d)

mlp
  • 809
  • 7
  • 21
user2054583
  • 47
  • 1
  • 5
  • 1
    I wouldn't really consider any of the horizontal instructions to be "basic". They all break paradigm of SIMD which is why they have been getting slower and slower on newer processors and left out of AVX512 completely. – Mysticial Feb 13 '18 at 16:25
  • Which platform? Which compiler? Please tag this question appropriately. – mlp Feb 13 '18 at 16:34
  • @mlp: It already is tagged appropriately: x86 (some version of SSE) with Intel's intrinsics, which are supported across all the major compilers. This question is simple enough the consider it for both Intel and AMD microarchitectures, especially because things there's nothing different about horizontal ops on AMD vs. Intel, so we don't need to narrow it down to just Skylake or just Ryzen or something. – Peter Cordes Feb 13 '18 at 17:18
  • There's already a good answer here about the rationale for not having a lot of horizontal operations, but you might want to take a look at [this blog post](https://blogs.msdn.microsoft.com/chuckw/2012/09/11/directxmath-sse3-and-ssse3/). – Chuck Walbourn Feb 14 '18 at 17:07
  • @ChuckWalbourn: that blog post claims it's a good idea to use `_mm_hadd_ps` twice for a horizontal sum. That's not the case if you want high throughput; see https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86. The useful SSE3 instruction is `_mm_movehdup_ps`, which gives you an FP copy-and-shuffle so the compiler doesn't need a `movaps / shufps` even when you compile without AVX. – Peter Cordes Mar 06 '18 at 21:18

2 Answers2

6

Generally you want to avoid designing your code to use horizontal ops in the first place; try to do the same thing to multiple data in parallel, instead of different things with different elements. But sometimes a local optimization is still worth it, and horizontal stuff can be better than pure scalar.

Intel experimented with adding horizontal ops in SSE3, but never added dedicated hardware to support them. They decode to 2 shuffles + 1 vertical op on all CPUs that support them (including AMD). See Agner Fog's instruction tables. More recent ISA extensions have mostly not included more horizontal ops, except for SSE4.1 dpps/dppd (which is also usually not worth using vs. manually shuffling).

SSSE3 pmaddubsw makes sense because element-width is already a problem for widening multiplication, and SSE4.1 phminposuw got dedicated HW support right away to make it worth using (and doing the same thing without it would cost a lot of uops, and it's specifically very useful for video encoding). But AVX / AVX2 / AVX512 horizontal ops are very scarce. AVX512 did introduce some nice shuffles, so you can build your own horizontal ops out of the powerful 2-input lane-crossing shuffles if needed.


If the most efficient solution to your problem already includes shuffling together two inputs two different ways and feeding that to an add or sub, then sure, haddpd is an efficient way to encode that; especially without AVX where preparing the inputs might have required a movaps instruction as well because shufpd is destructive (silently emitted by the compiler when using intrinsics, but still costs front-end bandwidth, and latency on CPUs like Sandybridge and earlier which don't eliminate reg-reg moves).

But if you were going to use the same input twice, haddpd is the wrong choice. See also Fastest way to do horizontal float vector sum on x86. hadd / hsub are only a good idea with two different inputs, e.g. as part of an on-the-fly transpose as part of some other operation on a matrix.


Anyway, the point is, build your own haddsub_pd if you want it, out of two shuffles + SSE3 addsubpd (which does have single-uop hardware support on CPUs that support it.) With AVX, it will be just as fast as a hypothetical haddsubpd instruction, and without AVX will typically cost one extra movaps because the compiler needs to preserve both inputs to the first shuffle. (Code-size will be bigger, but I'm talking about cost in uops for the front-end, and execution-port pressure for the back-end.)

 // Requires SSE3 (for addsubpd)

  // inputs: a=[a1 a0]  b=[b1 b0]
  // output:   [b1+b0, a1-a0],  like haddpd for b and hsubpd for a
static inline
__m128d haddsub_pd(__m128d a, __m128d b) {
    __m128d lows  = _mm_unpacklo_pd(a,b);  // [b0,    a0]
    __m128d highs = _mm_unpackhi_pd(a,b);  // [b1,    a1]
    return _mm_addsub_pd(highs, lows);     // [b1+b0, a1-a0]
}

With gcc -msse3 and clang (on Godbolt) we get the expected:

    movapd  xmm2, xmm0          # ICC saves a code byte here with movaps, but gcc/clang use movapd on double vectors for no advantage on any CPU.
    unpckhpd        xmm0, xmm1
    unpcklpd        xmm2, xmm1
    addsubpd        xmm0, xmm2
    ret

This wouldn't typically matter when inlining, but as a stand-alone function gcc and clang have trouble when they need the return value in the same register that b starts in, instead of a. (e.g. if the args are reversed so it's haddsub(b,a)).

# gcc for  haddsub_pd_reverseargs(__m128d b, __m128d a) 
    movapd  xmm2, xmm1          # copy b
    unpckhpd        xmm1, xmm0
    unpcklpd        xmm2, xmm0
    movapd  xmm0, xmm1          # extra copy to put the result in the right register
    addsubpd        xmm0, xmm2
    ret

clang actually does a better job, using a different shuffle (movhlps instead of unpckhpd) to still only use one register-copy:

# clang5.0
    movapd  xmm2, xmm1              # clangs comments go in least-significant-element first order, unlike my comments in the source which follow Intel's convention in docs / diagrams / set_pd() args order
    unpcklpd        xmm2, xmm0      # xmm2 = xmm2[0],xmm0[0]
    movhlps xmm0, xmm1              # xmm0 = xmm1[1],xmm0[1]
    addsubpd        xmm0, xmm2
    ret

For an AVX version with __m256d vectors, the in-lane behaviour of _mm256_unpacklo/hi_pd is actually what you want, for once, to get the even / odd elements.

static inline
__m256d haddsub256_pd(__m256d b, __m256d a) {
    __m256d lows  = _mm256_unpacklo_pd(a,b);  // [b2, a2 | b0, a0]
    __m256d highs = _mm256_unpackhi_pd(a,b);  // [b3, a3 | b1, a1]
    return _mm256_addsub_pd(highs, lows);     // [b3+b2, a3-a2 | b1+b0, a1-a0]
}

# clang and gcc both have an easy time avoiding wasted mov instructions
    vunpcklpd       ymm2, ymm1, ymm0 # ymm2 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
    vunpckhpd       ymm0, ymm1, ymm0 # ymm0 = ymm1[1],ymm0[1],ymm1[3],ymm0[3]
    vaddsubpd       ymm0, ymm0, ymm2

Of course, if you have the same input twice, i.e. you wanted the sum and difference between the two elements of a vector, you only need one shuffle to feed addsubpd

// returns [a1+a0  a1-a0]
static inline
__m128d sumdiff(__m128d a) {
    __m128d swapped = _mm_shuffle_pd(a,a, 0b01);
    return _mm_addsub_pd(swapped, a);
}

This actually compiles quite clunkily with both gcc and clang:

    movapd  xmm1, xmm0
    shufpd  xmm1, xmm0, 1
    addsubpd        xmm1, xmm0
    movapd  xmm0, xmm1
    ret

But the 2nd movapd should go away when inlining, if the compiler doesn't need the result in the same register it started with. I think gcc and clang are both missing an optimization here: they could swap xmm0 after copying it:

     # compilers should do this, but don't
    movapd  xmm1, xmm0         # a = xmm1 now
    shufpd  xmm0, xmm0, 1      # swapped = xmm0
    addsubpd xmm0, xmm1        # swapped +- a
    ret

Presumably their SSA-based register allocators don't think of using a 2nd register for the same value of a to free up xmm0 for swapped. Usually it's fine (and even preferable) to produce the result in a different register, so this is rarely a problem when inlining, only when looking at the stand-alone version of a function

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Nice and complete answer, it seems the best option is design changes to avoid usage of horizontal instructions. – user2054583 Feb 15 '18 at 00:00
2

How about:

__m128d a, b; //your inputs

const __m128d signflip_low_element = 
         _mm_castsi128_pd(_mm_set_epi64(0,0x8000000000000000));
b = _mm_xor_pd(b, signflip_low_element);  // negate b[0]
__m128d res = _mm_hadd_pd(a,b);

This builds haddsubpd in terms of haddpd, so it's only one extra instruction. Unfortunately haddpd is not very fast, with a throughput of one per 2 clock cycles on most CPUs, limited by FP shuffle throughput.

But this way is good for code-size (of the x86 machine code).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
C_Elegans
  • 1,113
  • 8
  • 15
  • `haddpd` costs 2 shuffles + an `addpd` on all CPUs that support it. Probably better to use one shuffle to feed SSE3 `addsubpd`. Especially with AVX where you don't need any weird tricks (like `pshufd` on FP data) to copy+shuffle and avoid a `movaps`). – Peter Cordes Feb 13 '18 at 14:11
  • Slightly cleaner, clearer and more portable than `_mm_set_epi64(0,0x8000000000000000)` is `_mm_set_pd(0.0, -0.0)`. – Paul R Feb 13 '18 at 14:57
  • @PaulR: re: the changelog on your edit: `_mm_xor_pd` is faster on Skylake than `_mm_xor_ps`? Of course the old version of the code wouldn't compile, but the asm equivalent (using xorps between double-precision FP instructions) should be totally equivalent to xorpd, but save one instruction byte for the non-AVX version. I'm not aware of any CPU where PS / PD booleans are different from each other in any way. Of course, if you cared about faster, you wouldn't be using `haddpd` in the first place. – Peter Cordes Feb 13 '18 at 15:04
  • @PeterCordes: it may be an error in the Intel intrinsics guide, but it says throughput = 1 for xorps and 0.33 for xorpd. (Skylake only). – Paul R Feb 13 '18 at 15:07
  • @PaulR: That's an error. They're all 0.33c throughput, like the integer version, not limited to port5 only like on Broadwell. I assume the ps and pd versions decode to the *same* internal uop. – Peter Cordes Feb 13 '18 at 15:24
  • @PeterCordes: yes, I thought it might be. Interestingly though, I notice that clang 5.0 compiles `_mm_xor_pd` to the integer version, `pxor`. (It also compiles `_mm_xor_ps` to `xorpd`, which is even stranger). (This applies to the OP's code - see godbolt link above - for anything more straightforward it behaves as expected.) – Paul R Feb 13 '18 at 15:26
  • @PaulR: `pxord` doesn't exist, do you mean AVX512 `vpxord`? Note that AVX512F only includes the integer booleans, so [EVEX `xorps` requires AVX512DQ](https://github.com/HJLebbink/asm-dude/wiki/XORPS). Of course, using the longer EVEX encoding is silly when you don't need it, but OTOH xmm/ymm/zmm16-31 can be used without needing `vzeroupper`, without any worry about SSE/AVX transition penalties, because legacy SSE can't touch the low lane of those registers. Oh NVM, you meant `xorpd`. Yes, that's a pure pessimisation to use a longer instruction (but neutral with the VEX version of both.) – Peter Cordes Feb 13 '18 at 15:31
  • @PeterCordes: yes, my bad - the OP was compiling with `-mavx` and so these were VEX instructions. Actually though, the Intel Instrinsics Guide lists `xorpd` as an SSE2 instruction - is it just an alias for `xorps` ? – Paul R Feb 13 '18 at 15:32
  • @PaulR: They are separate instructions in x86 machine code (with `xorpd` requiring an additional prefix, as usually for SSE1 `ps` vs. SSE2` pd`), but do exactly the same thing so the decoders could (if they wanted to) decode them to the same internal uop. – Peter Cordes Feb 13 '18 at 15:55
  • @PaulR They are separate instructions with different decodings, but internally they're all the same on Skylake. Not even a difference in bypass penalty. Same applies to all the shuffle variants. – Mysticial Feb 13 '18 at 16:27
  • @Mysticial: so does that mean they are all 0.33 throughput on Skylake ? I notice that clang 5.0 compiles all three variants to `xorps` regardless of the operand type. – Paul R Feb 13 '18 at 16:29
  • @PaulR For `xmm` and `ymm` operands, yes it's 0.33 throughput. For `zmm`, it's 0.5 since port1 is not available to SIMD in 512-bit mode. – Mysticial Feb 13 '18 at 16:31