When you only want a scalar result, normally you want to narrow in half until you're down to 1 element. Starting with _mm256_extractf128_ps
/ _mm256_castps256_ps128
first, so the rest of your operations are 128-bit not 256-bit, which makes them faster on Zen 1, and the E-cores in Alder Lake and later. This is discussed in Fastest way to do horizontal SSE vector sum (or other reduction) and How to sum __m256 horizontally? for an efficient hsum to a scalar float, ending with _mm_cvtss_f32
.
See Get sum of values stored in __m256d with SSE/AVX for discussion of 128-bit vs. 256-bit operations on different CPUs including Zen 1.
But then it takes extra instructions to broadcast the scalar back to every element of a vector. vbroadcastss ymm, xmm
was new in AVX2, so AVX2 needs to either store/reload for a memory-source vbroadcastss
, or use two shuffles (in-lane and then vinsertf128
). (The code in the question does a shuffle and a store/reload for a 128-bit broadcast instead of 32-bit.)
There are 2 good options here:
Keep everything 256-bit the whole time, and shuffle/max such that every element gets the max. One pattern that's easy to verify1 is to swap instead of just bringing high elements down to low. e.g. swap pairs of floats, then swap 64-bit chunks, then swap 128-bit lanes.
(When the logic is "obviously" correct without having to follow different things happening to different elements to check that every result element "sees" every input element, it's easier to read and maintain the code. The attempts in some other answers here are actually buggy, I think. Testing can help with that, e.g. have the unit test make an array like float x[] = {0, 1, 0, 0,...}
with the max in different positions.)
Or do it in the reverse order, starting with vperm2f128
(_mm256_permute2f128_ps
), since you're doing min and max so you can use the result of one of the shuffles for both min and max. Since lane-crossing shuffles are more expensive on some CPUs, especially vperm2f128
on Zen 1, doing this shuffle only once has advantages. (Otherwise I'd recommend doing the lower-latency in-lane shuffles first, so more of that shuffle/max dependency chain can get out of the scheduler and ROB sooner, giving out-of-order exec an easier time.)
Or, reduce to 128-bit then shuffle/max so every element of a __m128
ends up the same, before widening again to 256-bit. (If you had AVX2 for scalar to ymm broadcast, you could do it scalar.) In your case, you can do one sub
at 128-bit width, but the other operations involve or depend on the original x
where all 8 elements are different.
This strategy is probably good for Zen 1 and maybe Intel E-cores, but usually is worse for CPUs with 256-bit vector execution units (like Zen 2 and later, and Intel big-cores) because on those CPUs, 256-bit ops generally cost the same as their 128-bit equivalents. (Using a bit more power, so max turbo could be reduced.)
If you're doing a lot of this (for many 8-element arrays), vdivps
throughput could be a factor, especially on the oldest CPUs that can run this code (with AVX1 but not AVX2, like Sandybridge where 256-bit vdivps ymm
throughput is one per 14 cycles, vs. vdivps xmm
or vdivss
scalar having 7 cycle throughput cost on the div/sqrt unit, and unlike Skylake, higher latency for 256-bit div
. See another Q&A about div throughput / latency on various microarchitectures and how it's so much worse than other operations.)
So you might consider doing 1/(max-min)
with 128-bit and broadcasting that for use with (x-min) * recip_scale
. It would be more worthwhile if your arrays were larger (reusing the scale factor for multiple vectors of elements), although then you wouldn't have min/max of the same x
, you'd have separate accumulators of mins and maxes to independently reduce. (In that case you might do the lane-crossing shuffle first for one, last for the other, to reduce contention for the min
/max
execution ports.) You could even re-arrange to arr*(recip_scale) - (min*recip_scale)
so you'd only need one FMA for each vector of x
. Probably not worth it for only one vector since it would take an extra operation to compute (min*recip_scale)
. But there are AVX1 CPUs without FMA. (And one obscure Via CPU with AVX2 but not FMA, but if you were making another version of this function, use -march=x86-64-v3
AVX2+FMA+BMI1/2.)
vrcpps
can be used for fast approximate reciprocals, but only has about 12-bit precision without a Newton-Raphson iteration. On modern CPUs like Skylake (5 cycle throughput for 256-bit vdivps
), the extra uops for that might be more of a bottleneck than just a div
as part of the mix of many other operations. On older CPUs like Sandybridge, it might be worthwhile if you don't need full precision. (If you're not making a separate AVX2+FMA version that more modern CPUs will use, you should probably tune this version with modern CPUs in mind since they'll run it, too. Otherwise, you should only care about CPUs that have AVX1 but not AVX2: Sandy/Ivy Bridge, Jaguar, and most of Bulldozer-family.)
Narrow to 128-bit and re-widen
This does as much work as possible at 128-bit, including the max-min
. I put the min
first to encourage the compiler to put those instructions first, so that part of the critical path can get a head start, letting the x - min
subtraction happen while the max-min
is still broadcasting. Compilers schedule instructions themselves so it might do something different. And of course OoO exec will interleave much of the work, but oldest-ready-first scheduling will prioritize whichever instructions were visible first.
_MM_SHUFFLE
from immintrin.h
is a macro that makes it easy to write 8-bit shuffle-control constants with 4x 2-bit index fields. The highest position is on the left, so _MM_SHUFFLE(3,2,1,0) == 0b11'10'01'00
is the identity shuffle.
__m256 normalise_128(__m256 x)
{
__m128 xlow = _mm256_castps256_ps128(x);
__m128 xhigh = _mm256_extractf128_ps(x, 1); // reused by both min and max
__m128 min128 = _mm_min_ps(xlow, xhigh);
__m128 shuf = _mm_permute_ps(min128, _MM_SHUFFLE(2,3, 0,1)); // swap pairs
min128 = _mm_min_ps(min128, shuf);
shuf = _mm_permute_ps(min128, _MM_SHUFFLE(1,0, 3,2)); // swap 64-bit halves
min128 = _mm_min_ps(min128, shuf);
__m128 max128 = _mm_max_ps(xlow, xhigh);
shuf = _mm_permute_ps(max128, _MM_SHUFFLE(2,3, 0,1)); // swap pairs
max128 = _mm_max_ps(max128, shuf);
shuf = _mm_permute_ps(max128, _MM_SHUFFLE(1,0, 3,2)); // swap 64-bit halves
max128 = _mm_max_ps(max128, shuf); // all 4 elements hold the max
__m256 min = _mm256_set_m128(min128, min128); // vinsertf128
__m128 range128 = _mm_sub_ps(max128, min128); // This subtraction can be done before widening
__m256 range = _mm256_set_m128(range128, range128);
return _mm256_div_ps(_mm256_sub_ps(x, min), range);
}
256-bit all the way
Notice that both versions use the same _MM_SHUFFLE
constants for shuffling within 128-bit lanes. This is not an accident; both ways we want to end up with the min or max in all elements.
__m256 normalisation_all256(__m256 x)
{
__m256 xswapped = _mm256_permute2f128_ps(x,x,1); // swap 128-bit halves
__m256 min = _mm256_min_ps(x, xswapped); // low and high lanes are now the same
__m256 shuf = _mm256_permute_ps(min, _MM_SHUFFLE(2,3, 0,1)); // swap pairs
min = _mm256_min_ps(min, shuf);
shuf = _mm256_permute_ps(min, _MM_SHUFFLE(1,0, 3,2)); // swap 64-bit halves within lanes
min = _mm256_min_ps(min, shuf); // all 8 elements have seen every other
__m256 max = _mm256_max_ps(x, xswapped);
shuf = _mm256_permute_ps(max, _MM_SHUFFLE(2,3, 0,1)); // swap pairs
max = _mm256_max_ps(max, shuf);
shuf = _mm256_permute_ps(max, _MM_SHUFFLE(1,0, 3,2)); // swap 64-bit halves
max = _mm256_max_ps(max, shuf); // all 4 elements hold the max
__m256 range = _mm256_sub_ps(max, min);
return _mm256_div_ps(_mm256_sub_ps(x, min), range);
}
These version both compile to sensible asm on Godbolt. vperm2f128
is quite slow on Zen 1 (8 uops, 3c throughput), and vinsert/extractf128 are very efficient on Zen 1, so it will certainly be faster with normalise_128
, not doing redundant work in each half of an __m256
.
Micro-optimization: vshufps
instead of vpermilps
vpermilps
is the "obvious" shuffle for your use-case, but it's not the most efficient. On Ice Lake and Alder Lake P-cores, it can only run on execution port 5 (which has a shuffle unit that can handle every shuffle.)
vshufps
, the AVX version of the old SSE1 instruction, has 2 input operands, but with the same input twice it can do the same shuffle. Intel big-cores (P-cores) since Ice Lake can run it on port 1 or 5, so using it reduces the bottleneck on the shuffle unit in port 5. (min/max can run on port 0 or 1.) vpermilps
would be better on KNL Xeon Phi where 2-input shuffles are slower, but not on other CPUs. They're equal on Skylake, and on AMD Zen. (https://uops.info/)
Unfortunately clang doesn't optimize _mm_permute_ps
into vshufps
even with -mtune=icelake-client
. In fact it and GCC12 and later do the reverse, pessimizing _mm_shuffle_ps(same,same, i8)
into vpermilps
.
So for compilers like GCC11 and earlier, and MSVC, it would be better to write the source with _mm_shuffle_ps(min,min, _MM_SHUFFLE(2,3, 0,1))
. (Godbolt showing the asm difference)
For later GCC and clang, hopefully they sort out their tuning rules and learn that vpermilps
has worse throughput, so they should use vshufps
unless there's some reason to avoid letting shuffles schedule on port 1 where they could compete with FP math / compare operations like min/max.
If you're only doing one of these normalization ops mixed with other surrounding code, the extra pressure on port 5 might or might not be relevant, or might be a good thing if later instructions have lots of non-port-5 work (like FP math) that's independent of this work and could overlap if it doesn't steal cycles.