C simd AVX1 m256 horizontal max min normalisation

Question

I figured it out myself, didn't find any answer for avx1 (no avx2). So here is the answer for future persons in search of an answer.

8-float m256 max, then usable for normalisation as _max will be filled with the max of x

    __m256 _inv2_max;
    
    //  Normaliser x
    __m256 _inv = _mm256_permute_ps(x, 0b00011011);
    __m256 _max = _mm256_max_ps(x, _inv);

    _inv2_max = _mm256_permute_ps(_max, 0b000000010);
    _max = _mm256_max_ps(_inv2_max, _max);

    vlow  = _mm256_castps256_ps128(_max);
    vhigh = _mm256_extractf128_ps(_max, 1);
    __m128 a[1] = {_mm_permute_ps(_mm_max_ps(vlow, vhigh), 0b00000000)};
    
    _max = _mm256_broadcast_ps(a);

The max is already at [0] for _mm_max_ps(vlow, vhigh). Here I implemented so the max is brodcast to each location of _max.

You should only need 2 `_mm256_permute_ps` and one `_mm256_extractf128_ps` (preferably starting with the extract, for the benefit of CPUs that take 2 uops for 256-bit ops, like Alder Lake E-cores), as shown in [How to sum \_\_m256 horizontally?](https://stackoverflow.com/q/13219146) but with `max` instead of `add`. Or if you want the result broadcasted, tuning for normal CPUs you can do 2x in-lane shuffle+max that leaves each element holding the max for that 128-bit half, then `vperm2f128` (`_mm256_permute2f128_ps`) to swap halves and do another 256-bit max, leaving all elements the same. — Peter Cordes, Aug 22 '23 at 01:55
(The latter strategy of keeping 256-bit width throughout the whole operation and using `vperm2f128` is much slower on Zen 1 CPUs, and somewhat slower on E-cores, but a bit faster on Zen 2 and later and Intel "big" cores, since it avoids the final broadcast. See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) for more in general about horizontal reductions) — Peter Cordes, Aug 22 '23 at 01:58
The 3rd permute is after last Max, I use it juste to fill the array with the max value. — Vadim Kashtanov, Aug 22 '23 at 18:46
Oh, right, I forgot AVX1 didn't have `vbroadcastss ymm, xmm`, only memory-source `vbroadcastss ymm, [mem]`. So if you do narrow to 128-bit, you can at least use shuffles that leave the max in every element of the `__m128` instead of only caring about the low element. (To set up for a `vinsertf128` to widen back to 256-bit without store/reload. If you did want to store/reload, you'd do it with the scalar `float` you already have, not `__m128`). You're not using `movhlps` or anything to save machine-code size (no immediate operand), so you can just change the shuffle constants to symmetric. — Peter Cordes, Aug 22 '23 at 18:52
Ok thanks, I tought it was not possible to easly swap lanes in avx1. I will use _mm256_broadcast_ps and extract the second lane in m128. — Vadim Kashtanov, Aug 22 '23 at 22:16

Peter Cordes · Answer 1 · 2023-08-24T20:43:39.497

When you only want a scalar result, normally you want to narrow in half until you're down to 1 element. Starting with _mm256_extractf128_ps / _mm256_castps256_ps128 first, so the rest of your operations are 128-bit not 256-bit, which makes them faster on Zen 1, and the E-cores in Alder Lake and later. This is discussed in Fastest way to do horizontal SSE vector sum (or other reduction) and How to sum __m256 horizontally? for an efficient hsum to a scalar float, ending with _mm_cvtss_f32. See Get sum of values stored in __m256d with SSE/AVX for discussion of 128-bit vs. 256-bit operations on different CPUs including Zen 1.

But then it takes extra instructions to broadcast the scalar back to every element of a vector. vbroadcastss ymm, xmm was new in AVX2, so AVX2 needs to either store/reload for a memory-source vbroadcastss, or use two shuffles (in-lane and then vinsertf128). (The code in the question does a shuffle and a store/reload for a 128-bit broadcast instead of 32-bit.)

There are 2 good options here:

Keep everything 256-bit the whole time, and shuffle/max such that every element gets the max. One pattern that's easy to verify¹ is to swap instead of just bringing high elements down to low. e.g. swap pairs of floats, then swap 64-bit chunks, then swap 128-bit lanes.

(When the logic is "obviously" correct without having to follow different things happening to different elements to check that every result element "sees" every input element, it's easier to read and maintain the code. The attempts in some other answers here are actually buggy, I think. Testing can help with that, e.g. have the unit test make an array like float x[] = {0, 1, 0, 0,...} with the max in different positions.)

Or do it in the reverse order, starting with vperm2f128 (_mm256_permute2f128_ps), since you're doing min and max so you can use the result of one of the shuffles for both min and max. Since lane-crossing shuffles are more expensive on some CPUs, especially vperm2f128 on Zen 1, doing this shuffle only once has advantages. (Otherwise I'd recommend doing the lower-latency in-lane shuffles first, so more of that shuffle/max dependency chain can get out of the scheduler and ROB sooner, giving out-of-order exec an easier time.)
Or, reduce to 128-bit then shuffle/max so every element of a __m128 ends up the same, before widening again to 256-bit. (If you had AVX2 for scalar to ymm broadcast, you could do it scalar.) In your case, you can do one sub at 128-bit width, but the other operations involve or depend on the original x where all 8 elements are different.

This strategy is probably good for Zen 1 and maybe Intel E-cores, but usually is worse for CPUs with 256-bit vector execution units (like Zen 2 and later, and Intel big-cores) because on those CPUs, 256-bit ops generally cost the same as their 128-bit equivalents. (Using a bit more power, so max turbo could be reduced.)

If you're doing a lot of this (for many 8-element arrays), vdivps throughput could be a factor, especially on the oldest CPUs that can run this code (with AVX1 but not AVX2, like Sandybridge where 256-bit vdivps ymm throughput is one per 14 cycles, vs. vdivps xmm or vdivss scalar having 7 cycle throughput cost on the div/sqrt unit, and unlike Skylake, higher latency for 256-bit div. See another Q&A about div throughput / latency on various microarchitectures and how it's so much worse than other operations.)

So you might consider doing 1/(max-min) with 128-bit and broadcasting that for use with (x-min) * recip_scale. It would be more worthwhile if your arrays were larger (reusing the scale factor for multiple vectors of elements), although then you wouldn't have min/max of the same x, you'd have separate accumulators of mins and maxes to independently reduce. (In that case you might do the lane-crossing shuffle first for one, last for the other, to reduce contention for the min/max execution ports.) You could even re-arrange to arr*(recip_scale) - (min*recip_scale) so you'd only need one FMA for each vector of x. Probably not worth it for only one vector since it would take an extra operation to compute (min*recip_scale). But there are AVX1 CPUs without FMA. (And one obscure Via CPU with AVX2 but not FMA, but if you were making another version of this function, use -march=x86-64-v3 AVX2+FMA+BMI1/2.)

vrcpps can be used for fast approximate reciprocals, but only has about 12-bit precision without a Newton-Raphson iteration. On modern CPUs like Skylake (5 cycle throughput for 256-bit vdivps), the extra uops for that might be more of a bottleneck than just a div as part of the mix of many other operations. On older CPUs like Sandybridge, it might be worthwhile if you don't need full precision. (If you're not making a separate AVX2+FMA version that more modern CPUs will use, you should probably tune this version with modern CPUs in mind since they'll run it, too. Otherwise, you should only care about CPUs that have AVX1 but not AVX2: Sandy/Ivy Bridge, Jaguar, and most of Bulldozer-family.)

Narrow to 128-bit and re-widen

This does as much work as possible at 128-bit, including the max-min. I put the min first to encourage the compiler to put those instructions first, so that part of the critical path can get a head start, letting the x - min subtraction happen while the max-min is still broadcasting. Compilers schedule instructions themselves so it might do something different. And of course OoO exec will interleave much of the work, but oldest-ready-first scheduling will prioritize whichever instructions were visible first.

_MM_SHUFFLE from immintrin.h is a macro that makes it easy to write 8-bit shuffle-control constants with 4x 2-bit index fields. The highest position is on the left, so _MM_SHUFFLE(3,2,1,0) == 0b11'10'01'00 is the identity shuffle.

__m256 normalise_128(__m256 x)
{
    __m128 xlow  = _mm256_castps256_ps128(x);
    __m128 xhigh = _mm256_extractf128_ps(x, 1);   // reused by both min and max

    __m128 min128 = _mm_min_ps(xlow, xhigh);
    __m128 shuf = _mm_permute_ps(min128, _MM_SHUFFLE(2,3, 0,1)); // swap pairs
    min128 = _mm_min_ps(min128, shuf);
    shuf = _mm_permute_ps(min128, _MM_SHUFFLE(1,0, 3,2));        // swap 64-bit halves
    min128 = _mm_min_ps(min128, shuf);


    __m128 max128 = _mm_max_ps(xlow, xhigh);
    shuf   = _mm_permute_ps(max128, _MM_SHUFFLE(2,3, 0,1));  // swap pairs
    max128 = _mm_max_ps(max128, shuf);
    shuf   = _mm_permute_ps(max128, _MM_SHUFFLE(1,0, 3,2));  // swap 64-bit halves
    max128 = _mm_max_ps(max128, shuf);                      // all 4 elements hold the max

    __m256 min = _mm256_set_m128(min128, min128);   // vinsertf128
    __m128 range128 = _mm_sub_ps(max128, min128);   // This subtraction can be done before widening
    __m256 range = _mm256_set_m128(range128, range128);

    return _mm256_div_ps(_mm256_sub_ps(x, min), range);
}

256-bit all the way

Notice that both versions use the same _MM_SHUFFLE constants for shuffling within 128-bit lanes. This is not an accident; both ways we want to end up with the min or max in all elements.

__m256 normalisation_all256(__m256 x)
{
    __m256 xswapped = _mm256_permute2f128_ps(x,x,1);  // swap 128-bit halves

    __m256 min = _mm256_min_ps(x, xswapped);                  // low and high lanes are now the same
    __m256 shuf = _mm256_permute_ps(min, _MM_SHUFFLE(2,3, 0,1));  // swap pairs
    min  = _mm256_min_ps(min, shuf);
    shuf = _mm256_permute_ps(min, _MM_SHUFFLE(1,0, 3,2));         // swap 64-bit halves within lanes
    min  = _mm256_min_ps(min, shuf);     // all 8 elements have seen every other

    __m256 max = _mm256_max_ps(x, xswapped);
    shuf   = _mm256_permute_ps(max, _MM_SHUFFLE(2,3, 0,1));  // swap pairs
    max    = _mm256_max_ps(max, shuf);
    shuf   = _mm256_permute_ps(max, _MM_SHUFFLE(1,0, 3,2)); // swap 64-bit halves
    max    = _mm256_max_ps(max, shuf);                      // all 4 elements hold the max

    __m256 range = _mm256_sub_ps(max, min);
    return _mm256_div_ps(_mm256_sub_ps(x, min), range);
}

These version both compile to sensible asm on Godbolt. vperm2f128 is quite slow on Zen 1 (8 uops, 3c throughput), and vinsert/extractf128 are very efficient on Zen 1, so it will certainly be faster with normalise_128, not doing redundant work in each half of an __m256.

Micro-optimization: `vshufps` instead of `vpermilps`

vpermilps is the "obvious" shuffle for your use-case, but it's not the most efficient. On Ice Lake and Alder Lake P-cores, it can only run on execution port 5 (which has a shuffle unit that can handle every shuffle.)

vshufps, the AVX version of the old SSE1 instruction, has 2 input operands, but with the same input twice it can do the same shuffle. Intel big-cores (P-cores) since Ice Lake can run it on port 1 or 5, so using it reduces the bottleneck on the shuffle unit in port 5. (min/max can run on port 0 or 1.) vpermilps would be better on KNL Xeon Phi where 2-input shuffles are slower, but not on other CPUs. They're equal on Skylake, and on AMD Zen. (https://uops.info/)

Unfortunately clang doesn't optimize _mm_permute_ps into vshufps even with -mtune=icelake-client. In fact it and GCC12 and later do the reverse, pessimizing _mm_shuffle_ps(same,same, i8) into vpermilps.

So for compilers like GCC11 and earlier, and MSVC, it would be better to write the source with _mm_shuffle_ps(min,min, _MM_SHUFFLE(2,3, 0,1)). (Godbolt showing the asm difference)

For later GCC and clang, hopefully they sort out their tuning rules and learn that vpermilps has worse throughput, so they should use vshufps unless there's some reason to avoid letting shuffles schedule on port 1 where they could compete with FP math / compare operations like min/max.

If you're only doing one of these normalization ops mixed with other surrounding code, the extra pressure on port 5 might or might not be relevant, or might be a good thing if later instructions have lots of non-port-5 work (like FP math) that's independent of this work and could overlap if it doesn't steal cycles.

wow, thanks, for newbe in simd like me, it's good info – Vadim Kashtanov Aug 23 '23 at 23:39 — Vadim Kashtanov, Aug 23 '23 at 23:39

Vadim Kashtanov · Answer 2 · 2023-08-22T23:37:19.857

0

For AVX1

__m128 vlow  = _mm256_castps256_ps128(x);
__m128 vhigh = _mm256_extractf128_ps(x, 1); // high 128
vlow  = _mm_max_ps(vlow, vhigh);            // 4 elements to test
vhigh = _mm_permute_ps(vlow, 0b00001011);   // = {d,c,a,a}
vlow = _mm_max_ps(vlow, vhigh);             //comparing a/d and b/c
vhigh = _mm_permute_ps(vlow, 0b000000001);  // = {b,a,a,a}
vlow = _mm_max_ps(vlow, vhigh);             //comparing a/b
vlow = _mm_permute_ps(vlow, 0b00000000);
__m256 _max = _mm256_set_m128(vlow, vlow);

If x = {0,4,3,-1,7,8,-2,7} then __max will be {-2,-2,-2,-2,-2,-2,-2,-2}

To get only the max, extract the first element of the last vlow.

Extracting : ((float*)&vlow)[0]

Or use the m256 only version

__m256 _max = _mm256_max_ps(x, permute);
_max = _mm256_max_ps(_max, _mm256_permute_ps(_max, 0b00001011));
_max = _mm256_max_ps(_max, _mm256_permute_ps(_max, 0b00000001));
_max = _mm256_permute2f128_ps(_max, _max, 0b00000000);

At my scale it have the same performances, maybe the m256 one is more stable (and little little but faster).

edited Aug 22 '23 at 23:37

answered Aug 22 '23 at 22:27

Vadim Kashtanov

23
6

1

Use `_mm_cvtps_f32(vlow)` to get the low scalar element without any pointer stuff. `((float*)&vlow)[0]` is not a safe way to extract an element; it violates strict-aliasing by pointing a `float*` at something that isn't a `float`. (Unless a GNU C vector-of-float counts as multiple `float` elements which might be the case; it breaks in practice with `__m256i` and `int*` so it's not a good pattern. See [print a \_\_m128i variable](https://stackoverflow.com/q/13257166)). – Peter Cordes Aug 22 '23 at 23:45
Your `__m256`-only version looks broken. After reducing down to one vector, you need to shuffle before the first `max`; what's the `permute` variable in `max(x, permute)`? The final result should be the output of a `_m256_max_ps()` not a permute. Once you do that, then it's clearly more efficient on most CPUs, although not Zen 1 where `permute2f128` is slow. – Peter Cordes Aug 22 '23 at 23:51
Your `__m128` version could avoid the last `_mm_permute_ps` by choosing different shuffle constants so you end up with all 4 elements of `vlow` already holding the max. Your `__m256`-only version depends on that, but your shuffle constants don't do that, they only get the `max` into the low element. For example `x[2]` only ever gets maxed against `x[0]` (0b00) in the first shuffle/max, and `max(x[0], x[3])` in the second shuffle (`0b00` pulls the low element of the previous max). So if the highest element was `x[1]`, `_max[2]` will be wrong. – Peter Cordes Aug 22 '23 at 23:56
Oh, from your other answer it looks like `__m256 permute = _mm256_permute2f128_ps(x,x,1);`, so a lane-swap I think. You don't need the final `_mm256_permute2f128_ps(_max, _max, 0b00000000)` because both halves of the `__m256` already have the same values. Good optimization that min/max can both use the same swap as the starting point. – Peter Cordes Aug 22 '23 at 23:58
I need _mm256_permute2f128_ps(_max, _max, 0b00000000). Because I want a vector of maxs ({max, max, max, max, max ...}). – Vadim Kashtanov Aug 23 '23 at 14:05
So then I can do (x-min)/(max-min) with min, max vectors and x vector – Vadim Kashtanov Aug 23 '23 at 14:06
Maybe you're right that there is a probleme, but I used it, and on different kind of arrays it worked. Maybe I didn't thinked of all the possibilitys. I will consider you're comments. – Vadim Kashtanov Aug 23 '23 at 14:08
0b00000000 -> it will copy the first element to all elements – Vadim Kashtanov Aug 23 '23 at 14:10
I know that `_mm256_permute2f128_ps(v,v,0)` duplicates the low 128 bits. You don't need it because both 128-bit halves *already* have the same values, after the first step of `max(x, _mm256_permute2f128_ps(x,x,1))` which swaps 128-bit halves (which this answer omits). So whether or not your in-lane max is correct (the next 2 steps), the final `_mm256_permute2f128_ps(max,max, 0)` doesn't change the value. – Peter Cordes Aug 23 '23 at 16:49
1

I pointed out a specific way that your shuffles get wrong, when `x[1]` has the max element, `x[2]` will have the wrong max. Add that to your test-cases. – Peter Cordes Aug 23 '23 at 16:51

Vadim Kashtanov · Answer 3 · 2023-08-22T23:35:45.803

To normalise an Array with [(e-min)/(max-min) for e in array]

With m128

__m256 normalisation(__m256 x) {
    //  Normaliser x

    __m128 vlow  = _mm256_castps256_ps128(x);   // low 128
    __m128 vhigh = _mm256_extractf128_ps(x, 1); // high 128

    __m128 tmp0 = _mm_max_ps(vlow, vhigh);          // 4 elements to test
    __m128 tmp1 = _mm_permute_ps(tmp0, 0b00001011); // = {d,c,a,a}
    tmp0 = _mm_max_ps(tmp0, tmp1);              //comparing a/d and b/c
    tmp1 = _mm_permute_ps(tmp0, 0b00000001);    // = {b,a,a,a}
    tmp0 = _mm_max_ps(tmp0, tmp1);              //comparing a/b
    tmp0 = _mm_permute_ps(tmp0, 0b00000000);
    __m256 _max = _mm256_set_m128(tmp0, tmp0);

    tmp0 = _mm_min_ps(vlow, vhigh);             // 4 elements to test
    tmp1 = _mm_permute_ps(tmp0, 0b11100000);    // = {d,c,a,a}
    tmp0 = _mm_min_ps(tmp0, tmp1);              //comparing a/d and b/c
    tmp1 = _mm_permute_ps(tmp0, 0b01000000);    // = {b,a,a,a}
    tmp0 = _mm_min_ps(tmp0, tmp1);              //comparing a/b
    tmp0 = _mm_permute_ps(tmp0, 0b00000000);
    __m256 _min = _mm256_set_m128(tmp0, tmp0);

    return _mm256_div_ps(_mm256_sub_ps(x, _min), _mm256_sub_ps(_max, _min));
}

with m256 only

__m256 normalisation(__m256 x) {
    //  Normaliser x

    __m256 permute = _mm256_permute2f128_ps(x,x,1);
    
    __m256 _max = _mm256_max_ps(x, permute);
    _max = _mm256_max_ps(_max, _mm256_permute_ps(_max, 0b00001011));
    _max = _mm256_max_ps(_max, _mm256_permute_ps(_max, 0b00000001));
    _max = _mm256_permute2f128_ps(_max, _max, 0b00000000);
    
    __m256 _min = _mm256_min_ps(x, permute);
    _min = _mm256_min_ps(_min, _mm256_permute_ps(_min, 0b00001011));
    _min = _mm256_min_ps(_min, _mm256_permute_ps(_min, 0b00000001));
    _min = _mm256_permute2f128_ps(_min, _min, 0b00000000);
    
    return _mm256_div_ps(_mm256_sub_ps(x, _min), _mm256_sub_ps(_max, _min));
}

C simd AVX1 m256 horizontal max min normalisation

3 Answers3

Narrow to 128-bit and re-widen

256-bit all the way

Micro-optimization: vshufps instead of vpermilps

Micro-optimization: `vshufps` instead of `vpermilps`