The vhaddps
instruction adds in a very peculiar way:
Source: https://www.felixcloutier.com/x86/haddps
What is the reason for this? What use cases is this instruction made for? It looks like the design has something specific in mind.
The vhaddps
instruction adds in a very peculiar way:
Source: https://www.felixcloutier.com/x86/haddps
What is the reason for this? What use cases is this instruction made for? It looks like the design has something specific in mind.
It's 2 in-lane haddps
instructions in the low and high 128-bit lanes. Most AVX instructions don't really widen the operation to 256-bit, they do 2 separate in-lane operations. This makes AVX hard to use, especially without AVX2 for lane-crossing shuffles with less than 128-bit granularity!
But it saves transistors vs. e.g. making vpshufb
a single 32-byte shuffle instead of 2x 16-byte shuffles. AVX2 doesn't even provide that: Where is VPERMB in AVX2? (Have to wait for AVX512VBMI).
(related: best way to shuffle across AVX lanes? Also, AVX512 adds a lot of flexible lane-crossing shuffles, but the AXV512 versions of SSE/AVX instructions like vhaddps zmm
are still in-lane. See also Do 128bit cross lane operations in AVX512 give better performance?)
A chain of AVX2 vpack*
typically needs a vpermq
to do a lane-crossing fixup at the end, unless you're going to unpack in-lane again. So in most cases, 2x in-lane shuffles are worse than a full 256-bit wide operation, but that's not what we get from AVX. There's often still a speedup to be had from going to 256-bit vector up from 128 even if it requires extra shuffles to correct for in-lane behaviour, but that often means it's not a 2x speedup even if there are no memory bottlenecks.
vpalignr
is probably the most egregious example of 2x 128-bit versions of the same shuffle being not a useful building block on its own; I can't remember if I've ever seen a use-case for taking 2 separate in-lane byte windows of data. Oh, actually yes, if you feed it with vperm2i128
How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR) but usually unaligned loads are better on CPUs that support AVX2.
(v)haddps
are very limitedMaybe Intel planned to maybe make haddps
into a single-uop instruction at some point after introducing it with SSE3, but that never happened.
The use-cases include transpose-and-add type things where you'd need to shuffle both inputs for a vertical addps
anyway. e.g. Most efficient way to get a __m256 of horizontal sums of 8 source __m256 vectors includes vhaddps
. (Plus AVX1 vperm2f128
to correct for the in-lane behaviour.)
Many people mistakenly think it's good for horizontal sums of a single vector, but both 128 and 256-bit (v)haddps
decode to 2x shuffle uops to prepare input vectors for a vertical (v)addps
uop. For a horizontal sum you only need 1 shuffle uop per add. (Fastest way to do horizontal float vector sum on x86)
Narrowing to 128-bit first (with vextractf128
/ vaddps
) is usually a better first step unless you want the result broadcast to every element, and you're not on an AMD CPU (where 256-bit vector operations decode to at least 2 uops, or more for lane-crossing shuffles). (v)haddps xmm
or integer vphaddd
are useful for horizontal sums if you're optimizing for code-size not speed, e.g. my x86 machine-code answer on the code-golf question "Calculate the Mean mean of two numbers".
AVX non-destructive destination operands also remove some of the appeal of having a multi-uop instruction. Without AVX, sometimes you can't avoid a movaps
to copy a register before destroying it, so baking 2x shuffle + add into 1 instruction did actually save uops vs. having to do that manually with movaps
+ shufps
.
As with many 256-bit wide instructions the upper 128 bits
of vhaddps ymm ymm ymm
are just a copy paste of the 128-bit wide vhaddps xmm xmm xmm
instruction. The following example shows that it makes sense to
define vhaddps xmm xmm xmm
in such an involved way: Using this instruction twice
gives you the horizontal sum of 4 xmm
registers.
/* gcc -m64 -O3 hadd_ex.c -march=sandybridge */
#include<immintrin.h>
#include<stdio.h>
int main(){
float tmp[4];
__m128 a = _mm_set_ps(1.0, 2.0, 3.0, 4.0);
__m128 b = _mm_set_ps(10.0, 20.0, 30.0, 40.0);
__m128 c = _mm_set_ps(100.0, 200.0, 300.0, 400.0);
__m128 d = _mm_set_ps(1000.0, 2000.0, 3000.0, 4000.0);
__m128 sum1 = _mm_hadd_ps(a, b);
__m128 sum2 = _mm_hadd_ps(c, d);
__m128 sum = _mm_hadd_ps(sum1, sum2);
_mm_storeu_ps(tmp,sum);
printf("sum = %f %f %f %f\n", tmp[0], tmp[1], tmp[2], tmp[3]);
return 0;
}
Output:
sum = 10.000000 100.000000 1000.000000 10000.000000