Generally you want to avoid designing your code to use horizontal ops in the first place; try to do the same thing to multiple data in parallel, instead of different things with different elements. But sometimes a local optimization is still worth it, and horizontal stuff can be better than pure scalar.
Intel experimented with adding horizontal ops in SSE3, but never added dedicated hardware to support them. They decode to 2 shuffles + 1 vertical op on all CPUs that support them (including AMD). See Agner Fog's instruction tables. More recent ISA extensions have mostly not included more horizontal ops, except for SSE4.1 dpps
/dppd
(which is also usually not worth using vs. manually shuffling).
SSSE3 pmaddubsw
makes sense because element-width is already a problem for widening multiplication, and SSE4.1 phminposuw
got dedicated HW support right away to make it worth using (and doing the same thing without it would cost a lot of uops, and it's specifically very useful for video encoding). But AVX / AVX2 / AVX512 horizontal ops are very scarce. AVX512 did introduce some nice shuffles, so you can build your own horizontal ops out of the powerful 2-input lane-crossing shuffles if needed.
If the most efficient solution to your problem already includes shuffling together two inputs two different ways and feeding that to an add or sub, then sure, haddpd
is an efficient way to encode that; especially without AVX where preparing the inputs might have required a movaps
instruction as well because shufpd
is destructive (silently emitted by the compiler when using intrinsics, but still costs front-end bandwidth, and latency on CPUs like Sandybridge and earlier which don't eliminate reg-reg moves).
But if you were going to use the same input twice, haddpd
is the wrong choice. See also Fastest way to do horizontal float vector sum on x86. hadd
/ hsub
are only a good idea with two different inputs, e.g. as part of an on-the-fly transpose as part of some other operation on a matrix.
Anyway, the point is, build your own haddsub_pd
if you want it, out of two shuffles + SSE3 addsubpd
(which does have single-uop hardware support on CPUs that support it.) With AVX, it will be just as fast as a hypothetical haddsubpd
instruction, and without AVX will typically cost one extra movaps
because the compiler needs to preserve both inputs to the first shuffle. (Code-size will be bigger, but I'm talking about cost in uops for the front-end, and execution-port pressure for the back-end.)
// Requires SSE3 (for addsubpd)
// inputs: a=[a1 a0] b=[b1 b0]
// output: [b1+b0, a1-a0], like haddpd for b and hsubpd for a
static inline
__m128d haddsub_pd(__m128d a, __m128d b) {
__m128d lows = _mm_unpacklo_pd(a,b); // [b0, a0]
__m128d highs = _mm_unpackhi_pd(a,b); // [b1, a1]
return _mm_addsub_pd(highs, lows); // [b1+b0, a1-a0]
}
With gcc -msse3
and clang (on Godbolt) we get the expected:
movapd xmm2, xmm0 # ICC saves a code byte here with movaps, but gcc/clang use movapd on double vectors for no advantage on any CPU.
unpckhpd xmm0, xmm1
unpcklpd xmm2, xmm1
addsubpd xmm0, xmm2
ret
This wouldn't typically matter when inlining, but as a stand-alone function gcc and clang have trouble when they need the return value in the same register that b
starts in, instead of a
. (e.g. if the args are reversed so it's haddsub(b,a)
).
# gcc for haddsub_pd_reverseargs(__m128d b, __m128d a)
movapd xmm2, xmm1 # copy b
unpckhpd xmm1, xmm0
unpcklpd xmm2, xmm0
movapd xmm0, xmm1 # extra copy to put the result in the right register
addsubpd xmm0, xmm2
ret
clang actually does a better job, using a different shuffle (movhlps
instead of unpckhpd
) to still only use one register-copy:
# clang5.0
movapd xmm2, xmm1 # clangs comments go in least-significant-element first order, unlike my comments in the source which follow Intel's convention in docs / diagrams / set_pd() args order
unpcklpd xmm2, xmm0 # xmm2 = xmm2[0],xmm0[0]
movhlps xmm0, xmm1 # xmm0 = xmm1[1],xmm0[1]
addsubpd xmm0, xmm2
ret
For an AVX version with __m256d
vectors, the in-lane behaviour of _mm256_unpacklo/hi_pd
is actually what you want, for once, to get the even / odd elements.
static inline
__m256d haddsub256_pd(__m256d b, __m256d a) {
__m256d lows = _mm256_unpacklo_pd(a,b); // [b2, a2 | b0, a0]
__m256d highs = _mm256_unpackhi_pd(a,b); // [b3, a3 | b1, a1]
return _mm256_addsub_pd(highs, lows); // [b3+b2, a3-a2 | b1+b0, a1-a0]
}
# clang and gcc both have an easy time avoiding wasted mov instructions
vunpcklpd ymm2, ymm1, ymm0 # ymm2 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
vunpckhpd ymm0, ymm1, ymm0 # ymm0 = ymm1[1],ymm0[1],ymm1[3],ymm0[3]
vaddsubpd ymm0, ymm0, ymm2
Of course, if you have the same input twice, i.e. you wanted the sum and difference between the two elements of a vector, you only need one shuffle to feed addsubpd
// returns [a1+a0 a1-a0]
static inline
__m128d sumdiff(__m128d a) {
__m128d swapped = _mm_shuffle_pd(a,a, 0b01);
return _mm_addsub_pd(swapped, a);
}
This actually compiles quite clunkily with both gcc and clang:
movapd xmm1, xmm0
shufpd xmm1, xmm0, 1
addsubpd xmm1, xmm0
movapd xmm0, xmm1
ret
But the 2nd movapd should go away when inlining, if the compiler doesn't need the result in the same register it started with. I think gcc and clang are both missing an optimization here: they could swap xmm0
after copying it:
# compilers should do this, but don't
movapd xmm1, xmm0 # a = xmm1 now
shufpd xmm0, xmm0, 1 # swapped = xmm0
addsubpd xmm0, xmm1 # swapped +- a
ret
Presumably their SSA-based register allocators don't think of using a 2nd register for the same value of a
to free up xmm0 for swapped
. Usually it's fine (and even preferable) to produce the result in a different register, so this is rarely a problem when inlining, only when looking at the stand-alone version of a function