3

In How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, the OP asks for the inverse of _mm256_movemask_epi8, but with SSE's _mm_movemask_ps(), is there a simpler version? This is the best I could come up with, which isn't too bad.

__m128 movemask_inverse(int x) {
    __m128 m = _mm_setr_ps(x & 1, x & 2, x & 4, x & 8);
    return _mm_cmpneq_ps(m, _mm_setzero_ps());
}
Vortico
  • 2,610
  • 2
  • 32
  • 49
  • 1
    Should the masks be this way around? Isn't it backwards? – harold Jun 16 '19 at 20:26
  • @harold You are correct. Fixed by changing to `setr` – Vortico Jun 16 '19 at 20:27
  • 2
    Peter Cordes' answer on [_is there an inverse instruction to the movemask instruction in intel avx2?_](https://stackoverflow.com/a/36491672), discusses many ideas on the AVX2 case. Most of these ideas can be used in some form for the SSE case too. The LUT solution and the ALU solution are suitable for your case. – wim Jun 16 '19 at 21:12
  • See for example [this Godbolt link](https://godbolt.org/z/b4Q3F4). – wim Jun 16 '19 at 21:35
  • @wim Thanks, that's clever, and it performs a bit better. Post the code as an answer and I'll accept. – Vortico Jun 16 '19 at 21:55

1 Answers1

4

The efficiency of your inverse movemask strongly depends on the compiler. With gcc it takes about 21 instructions.

But, with clang -std=c99 -O3 -m64 -Wall -march=nehalem the code vectorizes well, and the results are not too bad actually:

movemask_inverse_original:              # @movemask_inverse_original
        movd    xmm0, edi
        pshufd  xmm0, xmm0, 0           # xmm0 = xmm0[0,0,0,0]
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        cvtdq2ps        xmm1, xmm0
        xorps   xmm0, xmm0
        cmpneqps        xmm0, xmm1
        ret
    

Nevertheless, you don't need the cvtdq2ps integer to float conversion. It is more efficient to compute the mask in the integer domain, and cast (without conversion) the results to float afterwards. Peter Cordes' answer on: is there an inverse instruction to the movemask instruction in intel avx2?, discusses many ideas on the AVX2 case. Most of these ideas can be used in some form for the SSE case too. The LUT solution and the ALU solution are suitable for your case.

ALU solution with intrinsics:

__m128 movemask_inverse_alternative(int x) {
    __m128i msk8421 = _mm_set_epi32(8, 4, 2, 1);
    __m128i x_bc = _mm_set1_epi32(x);
    __m128i t = _mm_and_si128(x_bc, msk8421);
    return _mm_castsi128_ps(_mm_cmpeq_epi32(msk8421, t));
}

Generated assembly with gcc 8.3: gcc -std=c99 -O3 -m64 -Wall -march=nehalem

movemask_inverse_alternative:
  movd xmm1, edi
  pshufd xmm0, xmm1, 0
  pand xmm0, XMMWORD PTR .LC0[rip]
  pcmpeqd xmm0, XMMWORD PTR .LC1[rip]
  ret
wim
  • 3,702
  • 19
  • 23
  • Shouldn't parameters to `_mm_cmpeq_epi32()` be `_mm_cmpeq_epi32(x_bcmsk8421, t)` ? Furthermore, `cmpeq` sets the integers to either `0xFF` or `0x00`, so wouldn't you need another `_mm_and_si128(, _mm_set1_epi32(1))`? – Till Kolditz Feb 25 '20 at 10:32
  • 1
    @TillKolditz Thanks for pointing out the error in my answer. I will fix it. Usually a result of `0` or `0xFFFFFFFF` is suitable for further computations, but in some cases you might prefer `_mm_and_ps(movemask_inverse_alternative(int x),_mm_set1_ps(1.0f))`, for example. Note that casting `_mm_set1_epi32(1)` to floting point does not lead to a vector of `1.0f`. – wim Jul 11 '20 at 00:02