x86 intrinsics - efficient way to pad out 8 * 32-bit masks to 8 * 64-bit masks?

Question

I am trying to increment a set of 8 x 64 bit uints depending on the result of a compare of 8 x 32-bit float comparisons.

I am storing the mask result of the comparison in a __m256 register and have the values to be increment stored in a __m256i[2]

Is there an efficient way of padding out the 256-bit mask to 512-bit? I am using:

    const __m256 paddedMask[2] = {
        _mm256_set_ps(mask.m256_f32[0], mask.m256_f32[0], mask.m256_f32[1], mask.m256_f32[1],
            mask.m256_f32[2], mask.m256_f32[2], mask.m256_f32[3], mask.m256_f32[3]),
        _mm256_set_ps(mask.m256_f32[4], mask.m256_f32[4], mask.m256_f32[5], mask.m256_f32[5],
            mask.m256_f32[6], mask.m256_f32[6], mask.m256_f32[7], mask.m256_f32[7])
    };

Complete compilable example can be found here: https://godbolt.org/z/3Y3PTnoj8

aqrit · Answer 1 · 2023-07-16T13:47:47.543

You could use sign-extension (vpmovsxdq).

extern __m256 cmp_ps;
extern __m256i values_lo;
extern __m256i values_hi;

// cast from float to integer
__m256i cmp_32 = _mm256_castps_si256(cmp_ps);

// extact each 128-bit half
__m128i lo_cmp_32 = _mm256_castsi256_si128(cmp_epi32); 
__m128i hi_cmp_32 = _mm256_extracti128_si256(cmp_epi32, 1);

// sign extend i32x4 to i64x4
__m256i lo_cmp_64 = _mm256_cvtepi32_epi64(lo_cmp_32);
__m256i hi_cmp_64 = _mm256_cvtepi32_epi64(hi_cmp_32);

// compare mask is either -1 or 0
// subtracting -1 is the same as adding 1
values_lo = _mm256_sub_epi64(values_lo, lo_cmp_64);
values_hi = _mm256_sub_epi64(values_hi, hi_cmp_64);

If you want to increment by other than 1 then just use a _mm256_and_si256 then add, not a blend.

However, moving values from the low 128-bits to the high 128-bits is fairly expensive. You should consider re-ordering the accumulators so only unpack instructions (vpunpckhdq/vpunpckldq) are needed.

Edit: Alternatively, using _mm256_permutevar8x32_epi32 for the hi 128-bits is cheaper than an extract and broadcast (sign extend).

Thank you for the reply. Great idea using the subtract with the mask. That does save doing an add/blend. I have seen some posts about some permutevar instructions but was unsure if they were the correct thing to do. I'll do some investigation into that :-) — allanmb, Jul 16 '23 at 16:18

score 1 · Answer 2 · answered Jul 16 '23 at 17:50

If you're doing this in a loop with the same vector of counters for all FP compares, you can just use 32-bit counts (_mm256_sub_epi32(counts32, _mm256_castps_epi32(compare))) for up to 2^32-1 iterations of an inner loop before needing to widen and add to the 64-bit counts. (With zero-extension which is easier than sign-extension for the high half.)

The same idea of using an inner loop to defer widening is shown in full-code answers to How to count character occurrences using SIMD (where the inner loop can only run at most 255 times, and there's a special trick to horizontal-sum 8-bit to 64-bit, unlike in this case.)

You might unroll with two accumulator vectors, since some CPUs can sustain two FP compares and two vpsubd integer accumulates per clock, so you'd avoid a latency bottleneck. (Alder Lake can do 3 loads per clock, enough to feed more than 1 FP compare per clock even if both operands come from memory. https://uops.info/.) Or not, if you expect your data to usually not be hot in L1d cache so L2 or worse bandwidth will be a bottleneck, especially if you need two loads per FP compare. Or if you're comparing the results of a more expensive computation.

Amortizing the cost of widening makes it basically irrelevant except for overall code-size, except for very small arrays where it still has to happen once for only a couple vectors of FP compares. So we should aim for small uop-cache footprint and/or needing vector constants.

If you ultimately want to sum down to a single scalar, position doesn't matter so we can add odd/even pairs like _mm256_and_si256(v, _mm256_set1_epi64(0x00000000FFFFFFFF)) and _mm256_srli_epi64(v, 32), or do the low halves with with shift left then right by 32 to zero-extend into the containing 64-bit element. Probably a better way to avoid loading a vector constant is _mm256_unpacklo_epi32 and _mm256_unpackhi_epi32(v, _mm256_setzero_si256()) (these operate in-lane, the same shuffle within each 128-bit lane).

Or if you limit the inner loop to 2^31-1 or 2^30 iterations so adding pairs still won't overflow 32 bits, you can extract the high half, _mm_add_epi32, and _mm256_cvtepu32_epi64 to zero-extend. Then add to a 64-bit accumulator with _mm256_add_epi64. This saves instructions and is still totally negligible cost since it runs so infrequently.

If position within the vector does matter, then similar to what @aqrit shows but with zero-extension instead of sign-extension: extract the high half and use vpmovzxdq twice (_mm256_cvtepu32_epi64), or use vpermd + vpsrlq (shift by 32) on the high half.

If you aren't keeping the same vector(s) of 64-bit counters across multiple vectors of FP compares, then yeah you probably need to widen each mask vector separately.

If you're using a strategy where zero-extending is cheaper than sign-extending, you might start with _mm256_abs_epi32 (vpabsd) to turn -1 (all-ones) into +1. But if you're extracting the high half to set up for 2x vpmovsxdq then just use vpsubq to accumulate.

Thanks. I like the idea of the two accumulators so I will try and roll that out negating the need to convert between 32-bit and 64-bit :-) — allanmb, Jul 19 '23 at 10:55

x86 intrinsics - efficient way to pad out 8 * 32-bit masks to 8 * 64-bit masks?

2 Answers2