I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb
lookup on the lowest bytes. This isn't terribly difficult with AVX-512 (replace the &
with a masked move if using 512-bit vectors):
__m256i cmp_8_into_32(__m256i a, __m256i b) {
return _mm256_popcnt_epi32(_mm256_cmpeq_epi8(a, b)
& _mm256_set1_epi32(0xff0f0301 /* can be any order */));
}
That's three uops and, assuming perfect scheduling, a throughput of 1 according to uops.info—not bad. Alas, vpopcntd
isn't in AVX2. What's the optimal way to do this operation there? The best I can think of is to mask the pairs of bits at indices 7,8 and 15,16, then perform two constant-amount vpsrld
and a vpor
. So that's 6 uops, throughput of 2.5 ish. Not bad, but I wonder if there's something better.