If you're doing this in a loop with the same vector of counters for all FP compares, you can just use 32-bit counts (_mm256_sub_epi32(counts32, _mm256_castps_epi32(compare))
) for up to 2^32-1 iterations of an inner loop before needing to widen and add to the 64-bit counts. (With zero-extension which is easier than sign-extension for the high half.)
The same idea of using an inner loop to defer widening is shown in full-code answers to How to count character occurrences using SIMD (where the inner loop can only run at most 255 times, and there's a special trick to horizontal-sum 8-bit to 64-bit, unlike in this case.)
You might unroll with two accumulator vectors, since some CPUs can sustain two FP compares and two vpsubd
integer accumulates per clock, so you'd avoid a latency bottleneck. (Alder Lake can do 3 loads per clock, enough to feed more than 1 FP compare per clock even if both operands come from memory. https://uops.info/.) Or not, if you expect your data to usually not be hot in L1d cache so L2 or worse bandwidth will be a bottleneck, especially if you need two loads per FP compare. Or if you're comparing the results of a more expensive computation.
Amortizing the cost of widening makes it basically irrelevant except for overall code-size, except for very small arrays where it still has to happen once for only a couple vectors of FP compares. So we should aim for small uop-cache footprint and/or needing vector constants.
If you ultimately want to sum down to a single scalar, position doesn't matter so we can add odd/even pairs like _mm256_and_si256(v, _mm256_set1_epi64(0x00000000FFFFFFFF))
and _mm256_srli_epi64(v, 32)
, or do the low halves with with shift left then right by 32 to zero-extend into the containing 64-bit element. Probably a better way to avoid loading a vector constant is _mm256_unpacklo_epi32
and _mm256_unpackhi_epi32(v, _mm256_setzero_si256())
(these operate in-lane, the same shuffle within each 128-bit lane).
Or if you limit the inner loop to 2^31-1 or 2^30 iterations so adding pairs still won't overflow 32 bits, you can extract the high half, _mm_add_epi32
, and _mm256_cvtepu32_epi64
to zero-extend. Then add to a 64-bit accumulator with _mm256_add_epi64
. This saves instructions and is still totally negligible cost since it runs so infrequently.
If position within the vector does matter, then similar to what @aqrit shows but with zero-extension instead of sign-extension: extract the high half and use vpmovzxdq
twice (_mm256_cvtepu32_epi64
), or use vpermd
+ vpsrlq
(shift by 32) on the high half.
If you aren't keeping the same vector(s) of 64-bit counters across multiple vectors of FP compares, then yeah you probably need to widen each mask vector separately.
If you're using a strategy where zero-extending is cheaper than sign-extending, you might start with _mm256_abs_epi32
(vpabsd
) to turn -1
(all-ones) into +1
. But if you're extracting the high half to set up for 2x vpmovsxdq
then just use vpsubq
to accumulate.