Why does the vhaddps instruction add in such an involved way?

Question

The vhaddps instruction adds in a very peculiar way:

Source: https://www.felixcloutier.com/x86/haddps

What is the reason for this? What use cases is this instruction made for? It looks like the design has something specific in mind.

Peter Cordes · Accepted Answer · 2019-05-12T08:37:02.587

It's 2 in-lane haddps instructions in the low and high 128-bit lanes. Most AVX instructions don't really widen the operation to 256-bit, they do 2 separate in-lane operations. This makes AVX hard to use, especially without AVX2 for lane-crossing shuffles with less than 128-bit granularity!

But it saves transistors vs. e.g. making vpshufb a single 32-byte shuffle instead of 2x 16-byte shuffles. AVX2 doesn't even provide that: Where is VPERMB in AVX2? (Have to wait for AVX512VBMI).

(related: best way to shuffle across AVX lanes? Also, AVX512 adds a lot of flexible lane-crossing shuffles, but the AXV512 versions of SSE/AVX instructions like vhaddps zmm are still in-lane. See also Do 128bit cross lane operations in AVX512 give better performance?)

A chain of AVX2 vpack* typically needs a vpermq to do a lane-crossing fixup at the end, unless you're going to unpack in-lane again. So in most cases, 2x in-lane shuffles are worse than a full 256-bit wide operation, but that's not what we get from AVX. There's often still a speedup to be had from going to 256-bit vector up from 128 even if it requires extra shuffles to correct for in-lane behaviour, but that often means it's not a 2x speedup even if there are no memory bottlenecks.

vpalignr is probably the most egregious example of 2x 128-bit versions of the same shuffle being not a useful building block on its own; I can't remember if I've ever seen a use-case for taking 2 separate in-lane byte windows of data. Oh, actually yes, if you feed it with vperm2i128 How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR) but usually unaligned loads are better on CPUs that support AVX2.

The use-cases for `(v)haddps` are very limited

Maybe Intel planned to maybe make haddps into a single-uop instruction at some point after introducing it with SSE3, but that never happened.

The use-cases include transpose-and-add type things where you'd need to shuffle both inputs for a vertical addps anyway. e.g. Most efficient way to get a __m256 of horizontal sums of 8 source __m256 vectors includes vhaddps. (Plus AVX1 vperm2f128 to correct for the in-lane behaviour.)

Many people mistakenly think it's good for horizontal sums of a single vector, but both 128 and 256-bit (v)haddps decode to 2x shuffle uops to prepare input vectors for a vertical (v)addps uop. For a horizontal sum you only need 1 shuffle uop per add. (Fastest way to do horizontal float vector sum on x86)

Narrowing to 128-bit first (with vextractf128 / vaddps) is usually a better first step unless you want the result broadcast to every element, and you're not on an AMD CPU (where 256-bit vector operations decode to at least 2 uops, or more for lane-crossing shuffles). (v)haddps xmm or integer vphaddd are useful for horizontal sums if you're optimizing for code-size not speed, e.g. my x86 machine-code answer on the code-golf question "Calculate the Mean mean of two numbers".

AVX non-destructive destination operands also remove some of the appeal of having a multi-uop instruction. Without AVX, sometimes you can't avoid a movaps to copy a register before destroying it, so baking 2x shuffle + add into 1 instruction did actually save uops vs. having to do that manually with movaps + shufps.

I guess your speed of typewriting is much higher than mine.:) — wim, May 12 '19 at 08:49
@wim: I can type text pretty quickly, yeah :) I learned good habits from the start; my dad had a typing-tutor game called Word Invaders on the Atari ST which I played some as a teen back in the 90s, so I've always touch-typed and known how you're *supposed* to type efficiently, which I mostly follow. And I've had quite a few years of practice by now. :P — Peter Cordes, May 12 '19 at 08:55
I started typewriting at a ZX Spectrum, that explains a lot. — wim, May 12 '19 at 09:01

score 2 · Answer 2 · answered May 12 '19 at 08:47

As with many 256-bit wide instructions the upper 128 bits of vhaddps ymm ymm ymm are just a copy paste of the 128-bit wide vhaddps xmm xmm xmm instruction. The following example shows that it makes sense to define vhaddps xmm xmm xmm in such an involved way: Using this instruction twice gives you the horizontal sum of 4 xmm registers.

/* gcc -m64 -O3 hadd_ex.c -march=sandybridge           */
#include<immintrin.h>
#include<stdio.h>
int main(){
    float tmp[4];
    __m128 a = _mm_set_ps(1.0, 2.0, 3.0, 4.0);
    __m128 b = _mm_set_ps(10.0, 20.0, 30.0, 40.0);
    __m128 c = _mm_set_ps(100.0, 200.0, 300.0, 400.0);
    __m128 d = _mm_set_ps(1000.0, 2000.0, 3000.0, 4000.0);
    __m128 sum1 = _mm_hadd_ps(a, b);
    __m128 sum2 = _mm_hadd_ps(c, d);
    __m128 sum = _mm_hadd_ps(sum1, sum2);
    _mm_storeu_ps(tmp,sum);
    printf("sum = %f  %f  %f  %f\n", tmp[0], tmp[1], tmp[2], tmp[3]);
    return 0;
}

Output:

sum = 10.000000  100.000000  1000.000000  10000.000000

Why does the vhaddps instruction add in such an involved way?

2 Answers2

The use-cases for (v)haddps are very limited

The use-cases for `(v)haddps` are very limited