I'm assuming you're doing this in a loop over a large array. If you only used __m128i
loads, you'd have 15 useful bytes, which would only produce 20 output bytes in your __m256i
output. (Well, I guess the 21st byte of output would be present, as the 16th byte of the input vector, the first 8 bytes of a new bitfield. But then your next vector would need to shuffle differently.)
Much better to use 24 bytes of input, producing 32 bytes of output. Ideally with a load that splits down the middle, so the low 12 bytes are in the low 128-bit "lane", avoiding the need for a lane-crossing shuffle like _mm256_permutexvar_epi32
. Instead you can just _mm256_shuffle_epi8
to put bytes where you want them, setting up for some shift/and.
// uses 24 bytes starting at p by doing a 32-byte load from p-4.
// Don't use this for the first vector of a page-aligned array, or the last
inline
__m256i unpack12to16(const char *p)
{
__m256i v = _mm256_loadu_si256( (const __m256i*)(p-4) );
// v= [ x H G F E | D C B A x ] where each letter is a 3-byte pair of two 12-bit fields, and x is 4 bytes of garbage we load but ignore
const __m256i bytegrouping =
_mm256_setr_epi8(4,5, 5,6, 7,8, 8,9, 10,11, 11,12, 13,14, 14,15, // low half uses last 12B
0,1, 1,2, 3,4, 4,5, 6, 7, 7, 8, 9,10, 10,11); // high half uses first 12B
v = _mm256_shuffle_epi8(v, bytegrouping);
// each 16-bit chunk has the bits it needs, but not in the right position
// in each chunk of 8 nibbles (4 bytes): [ f e d c | d c b a ]
__m256i hi = _mm256_srli_epi16(v, 4); // [ 0 f e d | xxxx ]
__m256i lo = _mm256_and_si256(v, _mm256_set1_epi32(0x00000FFF)); // [ 0000 | 0 c b a ]
return _mm256_blend_epi16(lo, hi, 0b10101010);
// nibbles in each pair of epi16: [ 0 f e d | 0 c b a ]
}
// Untested: I *think* I got my shuffle and blend controls right, but didn't check.
It compiles like this (Godbolt) with clang -O3 -march=znver2
. Of course an inline version would load the vector constants once, outside a loop.
unpack12to16(char const*): # @unpack12to16(char const*)
vmovdqu ymm0, ymmword ptr [rdi - 4]
vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = ymm0[4,5,5,6,7,8,8,9,10,11,11,12,13,14,14,15,16,17,17,18,19,20,20,21,22,23,23,24,25,26,26,27]
vpsrlw ymm1, ymm0, 4
vpand ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
vpblendw ymm0, ymm0, ymm1, 170 # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
ret
On Intel CPUs (before Ice Lake) vpblendw
only runs on port 5 (https://uops.info/), competing with vpshufb
(...shuffle_epi8
). But it's a single uop (unlike vpblendvb
variable-blend) with an immediate control. Still, that means a back-end ALU bottleneck of at best one vector per 2 cycles on Intel. If your src and dst are hot in L2 cache (or maybe only L1d), that might be the bottleneck, but this is already 5 uops for the front end, so with loop overhead and a store you're already close to a front-end bottleneck.
Blending with another vpand
/ vpor
would cost more front-end uops but would mitigate the back-end bottleneck on Intel (before Ice Lake). It would be worse on AMD, where vpblendw
can run on any of the 4 FP execution ports, and worse on Ice Lake where vpblendw
can run on p1 or p5. And like I said, cache load/store throughput might be a bigger bottleneck than port 5 anyway, so fewer front-end uops are definitely better to let out-of-order exec see farther.
This may not be optimal; perhaps there's some way to set up for vpunpcklwd
by getting the even (low) and odd (high) bit fields into the bottom 8 bytes of two separate input vectors even more cheaply? Or set up so we can blend with OR instead of needing to clear garbage in one input with vpblendw
which only runs on port 5 on Skylake?
Or something we can do with vpsrlvd
? (But not vpsrlvw
- that would require AVX-512).
If you have AVX512VBMI, vpmultishiftqb
is a parallel bitfield-extract. You'd just need to shuffle the right 3-byte pairs into the right 64-bit SIMD elements, then one _mm256_multishift_epi64_epi8
to put the good bits where you want them, and a _mm256_and_si256
to zero the high 4 bits of each 16-bit field will do the trick. (Can't quite take care of everything with 0-masking, or shuffling some zeros into the input for multishift, because there won't be any contiguous with the low 12-bit field.) Or you could set up for just an srli_epi16
that works for both low and high, instead of needing an AND constant, by having the multishift bitfield-extract line up both output fields with the bits you want at the top of the 16-bit element.
This may also allow a shuffle with larger granularity than bytes, although vpermb
is actually fast on CPUs with AVX512VBMI, and unfortunately Ice Lake's vpermw
is slower than vpermb
.
With AVX-512 but not AVX512VBMI, working in 256-bit chunks lets us do the same thing as AVX2 but avoiding the blend. Instead, use merge-masking for the right shift, or vpsrlvw
with a control vector to only shift the odd elements. For 256-bit vectors, this is probably as good as vpmultishiftqb
.