Most efficient way to get a m256 of horizontal sums of 8 source m256 vectors

Question

I know how to sum one __m256 to get a single summed value. However, I have 8 vectors like Input

1: a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7],
.....,
.....,
8: h[0], h[1], h[2], h[3], h[4], a[5], a[6], a[7]

Output

a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7], 
 ...., 
h[0]+h[1]+h[2]+h[3]+h[4]+h[5]+h[6]+h[7]

My method. Curious if there is a better way.

            __m256 sumab = _mm256_hadd_ps(accumulator1, accumulator2);
            __m256 sumcd = _mm256_hadd_ps(accumulator3, accumulator4);

            __m256 sumef = _mm256_hadd_ps(accumulator5, accumulator6);
            __m256 sumgh = _mm256_hadd_ps(accumulator7, accumulator8);

            __m256 sumabcd = _mm256_hadd_ps(sumab, sumcd);
            __m256 sumefgh = _mm256_hadd_ps(sumef, sumgh);

            __m128 sumabcd1 = _mm256_extractf128_ps(sumabcd, 0);
            __m128 sumabcd2 = _mm256_extractf128_ps(sumabcd, 1);
            __m128 sumefgh1 = _mm256_extractf128_ps(sumefgh, 0);
            __m128 sumefgh2 = _mm256_extractf128_ps(sumefgh, 1);

            sumabcd1 = _mm_add_ps(sumabcd1, sumabcd2);
            sumefgh1 = _mm_add_ps(sumefgh1, sumefgh2);

 __m256 result =_mm256_insertf128_ps(_mm256_castps128_ps256(sumabcd1), sumefgh1, 1)

I don't think you can improve greatly on this, but if it's really performance-critical then note that `_mm256_hadd_ps` typically has latency of 5, while `_mm256_add_ps` has a latency of 3, so maybe prefer the latter when you have a choice, even if it adds a few instructions. Intel's IACA tool can be useful for comparing the relative efficiency of small code fragments like this. — Paul R, Mar 24 '16 at 08:26
So you need a vector of horizontal sums of 8 source vectors? You *could* transpose first and then do vertical sums, but that's probably a lot slower. Have a look at [my horizontal-sum answer](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026) for ideas. Doing some of the work with insns other than haddps is probably good. OTOH, you're taking full advantage of the merging power of `hadd` at every step, never using it with both operands the same. `extract(x,0)` is free, since it's just a cast. — Peter Cordes, Mar 24 '16 at 09:32
Maybe you can rethink your algorithm so that you end up only needing vertical operator — Z boson, Mar 24 '16 at 17:16
For matrix-matrix multiplication I don't think there is a way to do a pure vertical operator without converting the inner loop multiplication from 8x8 to 1x8 in the inner loop that scales O(n^3) whereas the sum operation is in a section that only scales O(n^2). I'm just trying to create a user-friendly multi-threaded c++11 matrix library, doesn't have to be the fastest on a single core necessarily. I'm about 30% faster using AVX2 than Eigen and about 20% slower using SSE2 at the second. I think the difference is just cache size optimization to be frank after looking at the Eigen source code. — Chase R Lewis, Mar 24 '16 at 23:05

Peter Cordes · Accepted Answer · 2020-02-13T03:34:59.600

Update: Computing 8 horizontal sums of eight AVX single-precision floating-point vectors is (I think) the same problem, solved with one a blend replacing one of the _mm256_permute2f128_ps. And another answer with more blends replacing shuffle uops. Use one of those instead.

Original answer that fails to use any blends and will bottleneck on shuffles

You can use 2x _mm256_permute2f128_ps to line up the low and high lanes for a vertical vaddps. This is instead of 2x extractf128 / insertf128. This also turns two 128b vaddps xmm instructions into a single 256b vaddps ymm.

vperm2f128 is as fast as a single vextractf128 or vinsertf128 on Intel CPUs. It's slow on AMD, though (8 m-ops with 4c latency on Bulldozer-family). Still, not so bad that you need to avoid it, even if you care about perf on AMD. (And one of the permutes can actually be a vinsertf128).

__m256 hsum8(__m256 a, __m256 b, __m256 c, __m256 d,
             __m256 e, __m256 f, __m256 g, __m256 h)
{
    // a = [ A7 A6 A5 A4 | A3 A2 A1 A0 ]
    __m256 sumab = _mm256_hadd_ps(a, b);
    __m256 sumcd = _mm256_hadd_ps(c, d);

    __m256 sumef = _mm256_hadd_ps(e, f);
    __m256 sumgh = _mm256_hadd_ps(g, h);

    __m256 sumabcd = _mm256_hadd_ps(sumab, sumcd);  // [ D7:4 ... A7:4 | D3:0 ... A3:0 ]
    __m256 sumefgh = _mm256_hadd_ps(sumef, sumgh);  // [ H7:4 ... E7:4 | H3:0 ... E3:0 ]

    __m256 sum_hi = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x31);  // [ H7:4 ... E7:4 | D7:4 ... A7:4 ]
    __m256 sum_lo = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x20);  // [ H3:0 ... E3:0 | D3:0 ... A3:0 ]

    __m256 result = _mm256_add_ps(sum_hi, sum_lo);
    return result;
}

This compiles as you'd expect. The second permute2f128 actually compiles to a vinsertf128, since it's only using the low lane of each input in the same way that vinsertf128 does. gcc 4.7 and later do this optimization, but only much more recent clang versions do (v3.7). If you care about old clang, do this at the source level.

The savings in source lines is bigger than the savings in instructions, because _mm256_extractf128_ps(sumabcd, 0); compiles to zero instructions: it's just a cast. No compiler should ever emit vextractf128 with an imm8 other than 1. (vmovdqa xmm/m128, xmm is always better for getting the low lane). Nice job Intel on wasting an instruction byte on future-proofing that you couldn't use because plain VEX prefixes don't have room to encode longer vectors.

The two vaddps xmm instructions could run in parallel, so using a single vaddps ymm is mostly just a throughput (and code size) gain, not latency.

We do shave off 3 cycles from completely eliminating the final vinsertf128, though.

vhaddps is 3 uops, 5c latency, and one per 2c throughput. (6c latency on Skylake). Two of those three uops run on the shuffle port. I guess it's basically doing 2x shufps to generate operands for addps.

If we can emulate haddps (or at least get a horizontal operation we can use) with a single shufps/addps or something, we'd come out ahead. Unfortunately, I don't see how. A single shuffle can only produce one result with data from two vectors, but we need both inputs to vertical addps to have data from both vectors.

I don't think doing the horizontal sum another way looks promising. Normally, hadd is not a good choice, because the common horizontal-sum use-case only cares about one element of its output. That's not the case here: every element of every hadd result is actually used.

Ah thanks since starting AVX I still get confused with some of the permute and shuffle operations since they aren't quite as straightforwards as the old SSE registers. Any time I use 'extract' I know there is almost always a superior way with Shuffle, but couldn't figure that out. Much appreciated. — Chase R Lewis, Mar 24 '16 at 22:37
@user2927848: There are cases where insert/extract are useful. e.g. as a first step in a horizontal sum (or other reduction). AVX is hard because in-lane stuff is lower latency than cross-lane stuff. (And some stuff like cross-lane permutes (`vpermps`) requires AVX2). — Peter Cordes, Mar 25 '16 at 00:01

Most efficient way to get a m256 of horizontal sums of 8 source m256 vectors

1 Answers1

Original answer that fails to use any blends and will bottleneck on shuffles

Linked

Most efficient way to get a __m256 of horizontal sums of 8 source __m256 vectors

1 Answers1

Original answer that fails to use any blends and will bottleneck on shuffles

Linked

Most efficient way to get a m256 of horizontal sums of 8 source m256 vectors