AVX divide __m256i packed 32-bit integers by two (no AVX2)

Question

I'm looking for the fastest way to divide an __m256i of packed 32-bit integers by two (aka shift right by one) using AVX. I don't have access to AVX2. As far as I know, my options are:

Drop down to SSE2
Something like AVX __m256i integer division for signed 32-bit elements

In case I need to go down to SSE2 I'd appreciate the best SSE2 implementation. In case it's 2), I'd like to know the intrinsics to use and also if there's a more optimized implementation for specifically dividing by 2. Thanks!

signed or unsigned? With unsigned it's extremely easy, just a `PSRLD` is enough. You don't use `div` to divide a power of 2 but just do a right shift. If it's signed then it's much more difficult because [you'll need an `add` to correct the result](https://godbolt.org/z/s8cKsYo6f) and it's very tricky [do an integer add in AVX](https://stackoverflow.com/q/24399754/995714) — phuclv, May 01 '22 at 02:40
sorry you're correct. It's available in AVX but [only for xmm registers](https://www.felixcloutier.com/x86/psrlw:psrld:psrlq) — phuclv, May 01 '22 at 09:22
If you want integer math with AVX1, use 128-bit vectors. That's not strictly SSE2, though; the compiler can still use `vpsrld xmm1, xmm0, 1` to copy-and-shift, not destroying the input operand, saving itself a `movdqa` instruction if your code uses the original `__m128i` as well as the shift result later. — Peter Cordes, May 01 '22 at 10:51
If your integers are 0 `[0; 2^25-1]` you can get away with reinterpreting them as `float` and multiplying by `0.5f` (maybe mask away the lowest bit first to avoid the round-to-even behavior). — chtz, May 01 '22 at 14:49
@chtz: Oh interesting, yeah integers in that range are the bit-patterns for subnormal floats. (Or at least, divided by 2 will be subnormal). On some CPUs (e.g. Skylake), the performance will be disastrous (normal x subnormal => subnormal, or normal x normal => subnormal). And it breaks in executables built/linked with `-ffast-math`, where CRT startup code sets FTZ and DAZ (flush to zero and denormals are zero). — Peter Cordes, May 03 '22 at 04:07
Related: [Under what conditions does a C++ compiler use floating-point pipelines to do integer division with run-time-known values for higher performance?](https://stackoverflow.com/a/72088738) proposed the same reinterpret trick, using `vdivps` where both operands are integers that represent subnormal floats. — Peter Cordes, May 03 '22 at 04:09
Also didnt expect scalar memcpy to be vectorized. Gcc is intelligent — huseyin tugrul buyukisik, May 03 '22 at 05:31
@PeterCordes I agree that has probably very limited use-cases (especially since you can't do any further `int`-operations anyway on that vector). Of course `vcvtdq2ps`, `vmulps`, `vcvttps2dq` would be another option (by first masking away the lower bit, this would even work for slightly larger inputs). — chtz, May 03 '22 at 09:47
@huseyintugrulbuyukisik: It's not exactly "vectorizing memcpy", it's vectorizing scalar FP access to some objects, after `__builtin_memcpy` gets out of the way. — Peter Cordes, May 03 '22 at 10:03

score 6 · Accepted Answer · answered May 02 '22 at 20:19

Assuming you know what you’re doing, here’s that function.

inline __m256i div2_epi32( __m256i vec )
{
    // Split the 32-byte vector into 16-byte ones
    __m128i low = _mm256_castsi256_si128( vec );
    __m128i high = _mm256_extractf128_si256( vec, 1 );
    // Shift the lanes within each piece; replace with _mm_srli_epi32 for unsigned version
    low = _mm_srai_epi32( low, 1 );
    high = _mm_srai_epi32( high, 1 );
    // Combine back into 32-byte vector
    vec = _mm256_castsi128_si256( low );
    return _mm256_insertf128_si256( vec, high, 1 );
}

However, doing that is not necessarily faster than dealing with 16-byte vectors. On most CPUs, the performance of these insert/extract instructions ain’t great, except maybe AMD Zen 1 CPU.

AVX divide __m256i packed 32-bit integers by two (no AVX2)

1 Answers1

Linked