4 horizontal double-precision sums in one go with AVX

Question

The problem can be described as follow.

Input

__m256d a, b, c, d

Output

__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
             c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

Work I have done so far

It seemed easy enough: two VHADD with some shuffling in-between but in fact combining all permutations featured by AVX can't generate the very permutation needed to achieve that goal. Let me explain:

VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}

Were I able to permute x and y in the same manner to get

x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}

then

VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
                         c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

which is the result I wanted.

Thus I just need to find how to perform

x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}

Unfortunately I came to the conclusion that this is provably impossible using any combination of VSHUFPD, VBLENDPD, VPERMILPD, VPERM2F128, VUNPCKHPD, VUNPCKLPD. The crux of the matter is that it is impossible to swap u[1] and u[2] in an instance u of __m256d.

Question

Is this really a dead end? Or have I missed a permutation instruction?

Possibly you could make the task easier by swapping `b` and `c` in the input. That way you would have only to swap the inner two doubles. In any way, this is achievable using a combination of `vperm2f128`, two spare registers, `vpermilps` swapping the 1st/2nd and 3rd/4th and a final `vunpack[hl]pd` combo to finish off the unwary source code reader. — Gunther Piez, May 31 '12 at 12:59
Another idea (possibly even faster) would be a transposition of the input matrix (mirror along the diagonal) and do a vertical sum. That is, transform to `{a0, b0, c0, d0}` etc. and simply add up. — Gunther Piez, May 31 '12 at 13:01
I thought of swapping b and c too. As this code may be reused by others than myself, I wished I could find a solution devoid of such twist! As for your second suggestion, I can't reorganise the code to produce the transpose matrix in the first place and I can't see an efficient way to perform the transpose, so dead end I'd say. — , May 31 '12 at 16:46

Norbert P. · Accepted Answer · 2012-05-31T22:27:28.373

VHADD instructions are meant to be followed by regular VADD. The following code should give you what you want:

// {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
__m256d sumab = _mm256_hadd_pd(a, b);
// {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
__m256d sumcd = _mm256_hadd_pd(c, d);

// {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
__m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
// {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
__m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);

__m256d sum =  _mm256_add_pd(perm, blend);

This gives the result in 5 instructions. I hope I got the constants right.

The permutation that you proposed is certainly possible to accomplish, but it takes multiple instructions. Sorry that I'm not answering that part of your question.

Edit: I couldn't resist, here's the complete permutation. (Again, did my best to try to get the constants right.) You can see that swapping u[1] and u[2] is possible, just takes a bit of work. Crossing the 128bit barrier is difficult in the first gen. AVX. I also want to say that VADD is preferable to VHADD because VADD has twice the throughput, even though it's doing the same number of additions.

// {x[0],x[1],x[2],x[3]}
__m256d x;

// {x[1],x[0],x[3],x[2]}
__m256d xswap = _mm256_permute_pd(x, 0b0101);

// {x[3],x[2],x[1],x[0]}
__m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);

// {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
__m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);

// repeat the same for y
// {y[0],y[2],y[1],y[3]}
__m256d yblend;

// {x[0],x[2],y[0],y[2]}
__m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);

// {x[1],x[3],y[1],y[3]}
__m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);

Clever! Thanks! No, I assure you that the permutation I proposed can't be done but your scheme makes it superfluous. And with 5 instructions, you have given the most efficient solution. — , May 31 '12 at 16:47
@ljbou: one can perform an _arbitrary_ permutation of __m256d in at most 4 instruction in the current generation of AVX: one `VPERM2F128`, two `VSHUFPD`, and one `VBLENDPD`. Incoming [AVX2 in Haswell](http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/) is much more capable, allowing for an arbitrary permutation in one instruction (`VPERMPD` I think). — Norbert P., May 31 '12 at 23:52
This is genius, been struggling with the horizontal add for a few hours. Do you know any library that implements these commonly used operations in the least instructions/cycles possible? — jmiserez, Jun 14 '14 at 21:51

score 0 · Answer 2 · edited May 23 '17 at 12:25

I'm not aware of any instruction that lets you do that sort of permutation. AVX instructions typically operate such that the upper and lower 128 bits of the register are somewhat independent; there isn't much capability for intermingling values from the two halves. The best implementation I can think of would be based on the answer to this question:

__m128d horizontal_add_pd(__m256d x1, __m256d x2)
{
    // calculate 4 two-element horizontal sums:
    // lower 64 bits contain x1[0] + x1[1]
    // next 64 bits contain x2[0] + x1[1]
    // next 64 bits contain x1[2] + x1[3]
    // next 64 bits contain x2[2] + x2[3]
    __m256d sum = _mm256_hadd_pd(x1, x2);
    // extract upper 128 bits of result
    __m128d sum_high = _mm256_extractf128_pd(sum1, 1);
    // add upper 128 bits of sum to its lower 128 bits
    __m128d result = _mm_add_pd(sum_high, (__m128d) sum);
    // lower 64 bits of result contain the sum of x1[0], x1[1], x1[2], x1[3]
    // upper 64 bits of result contain the sum of x2[0], x2[1], x2[2], x2[3]
    return result;
}

__m256d a, b, c, d;
__m128d res1 = horizontal_add_pd(a, b);
__m128d res2 = horizontal_add_pd(c, d);
// At this point:
//     res1 contains a's horizontal sum in bits 0-63
//     res1 contains b's horizontal sum in bits 64-127
//     res2 contains c's horizontal sum in bits 0-63
//     res2 contains d's horizontal sum in bits 64-127
// cast res1 to a __m256d, then insert res2 into the upper 128 bits of the result
__m256d sum = _mm256_insertf128_pd(_mm256_castpd128_pd256(res1), res2, 1);
// At this point:
//     sum contains a's horizontal sum in bits 0-63
//     sum contains b's horizontal sum in bits 64-127
//     sum contains c's horizontal sum in bits 128-191
//     sum contains d's horizontal sum in bits 192-255

Which should be what you want. The above should be doable in 7 total instructions (the cast shouldn't really do anything; it's just a note to the compiler to change the way it's treating the value in res1), assuming that the short horizontal_add_pd() function can be inlined by your compiler and you have enough registers available.

as drhirsch pointed out, you can see this as the equivalent of 4 vertical sums plus some overhead, which amounts to 3/4 with your solution whereas Norbert's one gets down to 1/4. — , May 31 '12 at 17:16
hadd decodes to 3 uops (2x shuffle feeding a vertical add) on all Intel and AMD CPUs that support it. It's not worth using with both inputs the same, unless you're optimizing for code-size over speed. Extract the high half and add to the low half, from 256 down to 128, then 128 down to 64 (with `vunpcklpd` for example). — Peter Cordes, Jan 29 '19 at 19:42
See [Get sum of values stored in \_\_m256d with SSE/AVX](//stackoverflow.com/q/49941645) for a way that compiles to efficient asm. (But neither of these even answer this question, which isn't about hsum of one register, it's about a sum+transpose, which *is* one of the few use-cases for `_mm256_hadd_pd`.) — Peter Cordes, Jan 29 '19 at 19:45

4 horizontal double-precision sums in one go with AVX

2 Answers2

Linked