The efficiency of your inverse movemask strongly depends on the compiler.
With gcc it takes about 21 instructions.
But, with clang -std=c99 -O3 -m64 -Wall -march=nehalem
the code vectorizes well,
and the results are not too bad actually:
movemask_inverse_original: # @movemask_inverse_original
movd xmm0, edi
pshufd xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0]
pand xmm0, xmmword ptr [rip + .LCPI0_0]
cvtdq2ps xmm1, xmm0
xorps xmm0, xmm0
cmpneqps xmm0, xmm1
ret
Nevertheless, you don't need the cvtdq2ps
integer to float conversion.
It is more efficient to compute the mask in the integer domain, and
cast (without conversion) the results to float afterwards.
Peter Cordes' answer on: is there an inverse instruction to the movemask
instruction in intel avx2?,
discusses many ideas on the AVX2 case.
Most of these ideas can be used in some form for the SSE case too.
The LUT solution and the ALU solution are suitable for your case.
ALU solution with intrinsics:
__m128 movemask_inverse_alternative(int x) {
__m128i msk8421 = _mm_set_epi32(8, 4, 2, 1);
__m128i x_bc = _mm_set1_epi32(x);
__m128i t = _mm_and_si128(x_bc, msk8421);
return _mm_castsi128_ps(_mm_cmpeq_epi32(msk8421, t));
}
Generated assembly with gcc 8.3: gcc -std=c99 -O3 -m64 -Wall -march=nehalem
movemask_inverse_alternative:
movd xmm1, edi
pshufd xmm0, xmm1, 0
pand xmm0, XMMWORD PTR .LC0[rip]
pcmpeqd xmm0, XMMWORD PTR .LC1[rip]
ret