3

I'm looking for efficient ways to compute the following function:

Input: __m128i data, uint8_t in;

Output: boolean indicating if any byte in data is in.

I'm essentially using them to implement a space/time efficient stack for bytes with capacity 8. My most efficient solution is to first compute a __m128i tmp with all bytes as in. Then check if any bytes in tmp\xor data is a zero byte.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
redplum
  • 147
  • 5

1 Answers1

7

Yes, AVX2 has efficient byte-broadcast. SSSE3 pshufb with an all-zero mask is just as cheap, but you have to create the shuffle-control vector. AVX512BW/F even has single-instruction vpbroadcastb/w/d/q x/y/zmm, r32. (With optional masking, so you can zero some or merge with an existing vector if you want, e.g. insert at a position using a single-bit mask.)

Fortunately, compilers know how to do this when implementing _mm_set1_epi8 so we can leave it to the compiler.

Then it just boils down to the usual pcmpeqb/pmovmskb to get an integer which will have a 1 bit for matching elements, which you can branch on.

// 0 for not found, non-zero for found.  (Bit position tells you where).
unsigned contains(__m128i data, uint8_t needle) {
    __m128i k = _mm_set1_epi8(needle);
    __m128i cmp = _mm_cmpeq_epi8(data, k);  // vector mask
    return _mm_movemask_epi8(cmp);          // integer bitmask 
}

As you'd expect, all compilers make this use this asm (Godbolt)

contains(long long __vector(2), unsigned char):
    vmovd   xmm1, edi
    vpbroadcastb    xmm1, xmm1
    vpcmpeqb        xmm0, xmm0, xmm1
    vpmovmskb       eax, xmm0
    ret

Except MSVC, which wastes an instruction on movsx eax, dl first. (Windows x64 passes the 2nd arg in RDX, vs. x86-64 System V passing the first integer arg in RDI.)


Without AVX2, you'll get something like this with SSSE3 or above

# gcc8.3 -O3 -march=nehalem
contains(long long __vector(2), unsigned char):
    movd    xmm1, edi

    pxor    xmm2, xmm2
    pshufb  xmm1, xmm2         # _mm_shuffle_epi8(needle, _mm_setzero_si128())

    pcmpeqb xmm0, xmm1
    pmovmskb        eax, xmm0
    ret

Or with just SSE2 (baseline for x86-64):

contains(long long __vector(2), unsigned char):
    mov     DWORD PTR [rsp-12], edi
    movd    xmm1, DWORD PTR [rsp-12]    # gcc's tune=generic strategy is still store/reload  /facepalm
    punpcklbw       xmm1, xmm1          # duplicate to low 2 bytes
    punpcklwd       xmm1, xmm1          # duplciate to low 4 bytes
    pshufd  xmm1, xmm1, 0               # broadcast

    pcmpeqb xmm1, xmm0
    pmovmskb        eax, xmm1
    ret

Related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847