What is the fastest inverse of _mm_movemask_ps()?

Question

In How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, the OP asks for the inverse of _mm256_movemask_epi8, but with SSE's _mm_movemask_ps(), is there a simpler version? This is the best I could come up with, which isn't too bad.

__m128 movemask_inverse(int x) {
    __m128 m = _mm_setr_ps(x & 1, x & 2, x & 4, x & 8);
    return _mm_cmpneq_ps(m, _mm_setzero_ps());
}

Peter Cordes' answer on [_is there an inverse instruction to the movemask instruction in intel avx2?_](https://stackoverflow.com/a/36491672), discusses many ideas on the AVX2 case. Most of these ideas can be used in some form for the SSE case too. The LUT solution and the ALU solution are suitable for your case. — wim, Jun 16 '19 at 21:12
See for example [this Godbolt link](https://godbolt.org/z/b4Q3F4). — wim, Jun 16 '19 at 21:35
@wim Thanks, that's clever, and it performs a bit better. Post the code as an answer and I'll accept. — Vortico, Jun 16 '19 at 21:55

wim · Accepted Answer · 2020-07-11T00:11:29.880

The efficiency of your inverse movemask strongly depends on the compiler. With gcc it takes about 21 instructions.

But, with clang -std=c99 -O3 -m64 -Wall -march=nehalem the code vectorizes well, and the results are not too bad actually:

movemask_inverse_original:              # @movemask_inverse_original
        movd    xmm0, edi
        pshufd  xmm0, xmm0, 0           # xmm0 = xmm0[0,0,0,0]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        cvtdq2ps        xmm1, xmm0
        xorps   xmm0, xmm0
        cmpneqps        xmm0, xmm1
        ret

Nevertheless, you don't need the cvtdq2ps integer to float conversion. It is more efficient to compute the mask in the integer domain, and cast (without conversion) the results to float afterwards. Peter Cordes' answer on: is there an inverse instruction to the movemask instruction in intel avx2?, discusses many ideas on the AVX2 case. Most of these ideas can be used in some form for the SSE case too. The LUT solution and the ALU solution are suitable for your case.

ALU solution with intrinsics:

__m128 movemask_inverse_alternative(int x) {
    __m128i msk8421 = _mm_set_epi32(8, 4, 2, 1);
    __m128i x_bc = _mm_set1_epi32(x);
    __m128i t = _mm_and_si128(x_bc, msk8421);
    return _mm_castsi128_ps(_mm_cmpeq_epi32(msk8421, t));
}

Generated assembly with gcc 8.3: gcc -std=c99 -O3 -m64 -Wall -march=nehalem

movemask_inverse_alternative:
  movd xmm1, edi
  pshufd xmm0, xmm1, 0
  pand xmm0, XMMWORD PTR .LC0[rip]
  pcmpeqd xmm0, XMMWORD PTR .LC1[rip]
  ret

Shouldn't parameters to `_mm_cmpeq_epi32()` be `_mm_cmpeq_epi32(x_bcmsk8421, t)` ? Furthermore, `cmpeq` sets the integers to either `0xFF` or `0x00`, so wouldn't you need another `_mm_and_si128(, _mm_set1_epi32(1))`? — Till Kolditz, Feb 25 '20 at 10:32
@TillKolditz Thanks for pointing out the error in my answer. I will fix it. Usually a result of `0` or `0xFFFFFFFF` is suitable for further computations, but in some cases you might prefer `_mm_and_ps(movemask_inverse_alternative(int x),_mm_set1_ps(1.0f))`, for example. Note that casting `_mm_set1_epi32(1)` to floting point does not lead to a vector of `1.0f`. — wim, Jul 11 '20 at 00:02

What is the fastest inverse of _mm_movemask_ps()?

1 Answers1