Yes, AVX2 has efficient byte-broadcast. SSSE3 pshufb
with an all-zero mask is just as cheap, but you have to create the shuffle-control vector. AVX512BW/F even has single-instruction vpbroadcastb/w/d/q x/y/zmm, r32
. (With optional masking, so you can zero some or merge with an existing vector if you want, e.g. insert at a position using a single-bit mask.)
Fortunately, compilers know how to do this when implementing _mm_set1_epi8
so we can leave it to the compiler.
Then it just boils down to the usual pcmpeqb
/pmovmskb
to get an integer which will have a 1
bit for matching elements, which you can branch on.
// 0 for not found, non-zero for found. (Bit position tells you where).
unsigned contains(__m128i data, uint8_t needle) {
__m128i k = _mm_set1_epi8(needle);
__m128i cmp = _mm_cmpeq_epi8(data, k); // vector mask
return _mm_movemask_epi8(cmp); // integer bitmask
}
As you'd expect, all compilers make this use this asm (Godbolt)
contains(long long __vector(2), unsigned char):
vmovd xmm1, edi
vpbroadcastb xmm1, xmm1
vpcmpeqb xmm0, xmm0, xmm1
vpmovmskb eax, xmm0
ret
Except MSVC, which wastes an instruction on movsx eax, dl
first. (Windows x64 passes the 2nd arg in RDX, vs. x86-64 System V passing the first integer arg in RDI.)
Without AVX2, you'll get something like this with SSSE3 or above
# gcc8.3 -O3 -march=nehalem
contains(long long __vector(2), unsigned char):
movd xmm1, edi
pxor xmm2, xmm2
pshufb xmm1, xmm2 # _mm_shuffle_epi8(needle, _mm_setzero_si128())
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
ret
Or with just SSE2 (baseline for x86-64):
contains(long long __vector(2), unsigned char):
mov DWORD PTR [rsp-12], edi
movd xmm1, DWORD PTR [rsp-12] # gcc's tune=generic strategy is still store/reload /facepalm
punpcklbw xmm1, xmm1 # duplicate to low 2 bytes
punpcklwd xmm1, xmm1 # duplciate to low 4 bytes
pshufd xmm1, xmm1, 0 # broadcast
pcmpeqb xmm1, xmm0
pmovmskb eax, xmm1
ret
Related: