I think I'd probably go for the "brute force and ignorance" approach initially, maybe something like this:
uint8_t u = 0x53; // 01010011
const union {
uint32_t a[4];
__m128i v;
} kLUT[16] = { { { 0, 0, 0, 0 } },
{ { -1, 0, 0, 0 } },
{ { 0, -1, 0, 0 } },
{ { -1, -1, 0, 0 } },
{ { 0, 0, -1, 0 } },
{ { -1, 0, -1, 0 } },
{ { 0, -1, -1, 0 } },
{ { -1, -1, -1, 0 } },
{ { 0, 0, 0, -1 } },
{ { -1, 0, 0, -1 } },
{ { 0, -1, 0, -1 } },
{ { -1, -1, 0, -1 } },
{ { 0, 0, -1, -1 } },
{ { -1, 0, -1, -1 } },
{ { 0, -1, -1, -1 } },
{ { -1, -1, -1, -1 } } };
__m256i v = _mm256_set_m128i(kLUT[u >> 4].v, kLUT[u & 15].v);
Using clang -O3
this compiles to:
movl %ebx, %eax ;; eax = ebx = u
andl $15, %eax ;; get low offset = (u & 15) * 16
shlq $4, %rax
leaq _main.kLUT(%rip), %rcx ;; rcx = kLUT
vmovaps (%rax,%rcx), %xmm0 ;; load low half of ymm0 from kLUT
andl $240, %ebx ;; get high offset = (u >> 4) * 16
vinsertf128 $1, (%rbx,%rcx), %ymm0, %ymm0
;; load high half of ymm0 from kLUT
FWIW I threw together a simple test harness for three implementations: (i) a simple scalar code reference implementation, (ii) the above code, (iii) an implementation based on @Zboson's answer, (iv) a slightly improved version of (iii) and (v) a further improvement on (iv) using a suggestion from @MarcGlisse. I got the following results with a 2.6GHz Haswell CPU (compiled with clang -O3
):
scalar code: 7.55336 ns / vector
Paul R: 1.36016 ns / vector
Z boson: 1.24863 ns / vector
Z boson (improved): 1.07590 ns / vector
Z boson (improved + @MarcGlisse suggestion): 1.08195 ns / vector
So @Zboson's solution(s) win, by around 10% - 20%, presumably because they need only 1 load, versus 2 for mine.
If we get any other implementations I'll add these to the test harness and update the results.
Slightly improved version of @Zboson's implementation:
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
v = _mm256_xor_si256(v, mask);
return _mm256_cmpeq_epi32(v, _mm256_setzero_si256());
Further improved version of @Zboson's implementation incorporating suggestion from @MarcGlisse:
__m256i v = _mm256_set1_epi8(u);
v = _mm256_and_si256(v, mask);
return _mm256_cmpeq_epi32(v, mask);
(Note that mask
needs to contain replicated 8 bit values in each 32 bit element, i.e. 0x01010101, 0x02020202, ..., 0x80808080
)