SIMD unpack 12-bit fields to 16-bit

Question

I need to unpack two 16-bit values from each 24 bits of input. (3 bytes -> 4 bytes). I already did it the naïve way but I'm not happy with the performance.

For example, InBuffer is __m128i:

value1 = (uint16_t)InBuffer[0:11]        // bit-ranges
value2 = (uint16_t)InBuffer[12:24]

value3 = (uint16_t)InBuffer[25:36] 
value4 = (uint16_t)InBuffer[37:48]
... for all the 128 bits.

After the unpacking, The values should be stored in __m256i variable.

How can I solve this with AVX2? Probably using unpack / shuffle / permute intrinsics?

Have you looked for existing 12->16-bit conversion functions? I'd expect this is something that people have already optimized for video pixel-format conversions (like planar yuv12 to an unpacked 16-bit format with one sample per uint16_t). e.g. [Python: Fast way to read/unpack 12 bit little endian packed data](https://stackoverflow.com/a/65952153) has a numpy answer; if your project's license is compatible with NumPy's open-source license, you could go looking there. (Although IDK if even that is using AVX2.) — Peter Cordes, Feb 24 '21 at 09:43
Are you doing this in a loop over a large array, storing the result to memory? (With 32-byte stores, fed by 32-byte loads that overlap by 8 bytes? Or with __m128i loads where you turn 15 bytes into 20, and overlap the stores by 12 bytes?) Unaligned __m256i loads that leave 12 useful bytes in each 128-bit half are probably your best bet, so you don't need any lane-crossing shuffles. Just `_mm256_shuffle_epi8` and some shift/and. — Peter Cordes, Feb 24 '21 at 09:50
Yes I want to do it over a large array (video frames). According to your answer, How the code should looks like? — OC87, Feb 24 '21 at 10:39

Peter Cordes · Answer 1 · 2021-02-24T11:30:26.933

I'm assuming you're doing this in a loop over a large array. If you only used __m128i loads, you'd have 15 useful bytes, which would only produce 20 output bytes in your __m256i output. (Well, I guess the 21st byte of output would be present, as the 16th byte of the input vector, the first 8 bytes of a new bitfield. But then your next vector would need to shuffle differently.)

Much better to use 24 bytes of input, producing 32 bytes of output. Ideally with a load that splits down the middle, so the low 12 bytes are in the low 128-bit "lane", avoiding the need for a lane-crossing shuffle like _mm256_permutexvar_epi32. Instead you can just _mm256_shuffle_epi8 to put bytes where you want them, setting up for some shift/and.

// uses 24 bytes starting at p by doing a 32-byte load from p-4.
// Don't use this for the first vector of a page-aligned array, or the last
inline
__m256i unpack12to16(const char *p)
{
    __m256i v = _mm256_loadu_si256( (const __m256i*)(p-4) );
   // v= [ x H G F E | D C B A x ]   where each letter is a 3-byte pair of two 12-bit fields, and x is 4 bytes of garbage we load but ignore

    const __m256i bytegrouping =
        _mm256_setr_epi8(4,5, 5,6,  7,8, 8,9,  10,11, 11,12,  13,14, 14,15, // low half uses last 12B
                         0,1, 1,2,  3,4, 4,5,   6, 7,  7, 8,   9,10, 10,11); // high half uses first 12B
    v = _mm256_shuffle_epi8(v, bytegrouping);
    // each 16-bit chunk has the bits it needs, but not in the right position

    // in each chunk of 8 nibbles (4 bytes): [ f e d c | d c b a ]
    __m256i hi = _mm256_srli_epi16(v, 4);                              // [ 0 f e d | xxxx ]
    __m256i lo  = _mm256_and_si256(v, _mm256_set1_epi32(0x00000FFF));  // [ 0000 | 0 c b a ]

    return _mm256_blend_epi16(lo, hi, 0b10101010);
      // nibbles in each pair of epi16: [ 0 f e d | 0 c b a ] 
}

// Untested: I *think* I got my shuffle and blend controls right, but didn't check.

It compiles like this (Godbolt) with clang -O3 -march=znver2. Of course an inline version would load the vector constants once, outside a loop.

unpack12to16(char const*):                    # @unpack12to16(char const*)
        vmovdqu ymm0, ymmword ptr [rdi - 4]
        vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] # ymm0 = ymm0[4,5,5,6,7,8,8,9,10,11,11,12,13,14,14,15,16,17,17,18,19,20,20,21,22,23,23,24,25,26,26,27]
        vpsrlw  ymm1, ymm0, 4
        vpand   ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
        vpblendw        ymm0, ymm0, ymm1, 170           # ymm0 = ymm0[0],ymm1[1],ymm0[2],ymm1[3],ymm0[4],ymm1[5],ymm0[6],ymm1[7],ymm0[8],ymm1[9],ymm0[10],ymm1[11],ymm0[12],ymm1[13],ymm0[14],ymm1[15]
        ret

On Intel CPUs (before Ice Lake) vpblendw only runs on port 5 (https://uops.info/), competing with vpshufb (...shuffle_epi8). But it's a single uop (unlike vpblendvb variable-blend) with an immediate control. Still, that means a back-end ALU bottleneck of at best one vector per 2 cycles on Intel. If your src and dst are hot in L2 cache (or maybe only L1d), that might be the bottleneck, but this is already 5 uops for the front end, so with loop overhead and a store you're already close to a front-end bottleneck.

Blending with another vpand / vpor would cost more front-end uops but would mitigate the back-end bottleneck on Intel (before Ice Lake). It would be worse on AMD, where vpblendw can run on any of the 4 FP execution ports, and worse on Ice Lake where vpblendw can run on p1 or p5. And like I said, cache load/store throughput might be a bigger bottleneck than port 5 anyway, so fewer front-end uops are definitely better to let out-of-order exec see farther.

This may not be optimal; perhaps there's some way to set up for vpunpcklwd by getting the even (low) and odd (high) bit fields into the bottom 8 bytes of two separate input vectors even more cheaply? Or set up so we can blend with OR instead of needing to clear garbage in one input with vpblendw which only runs on port 5 on Skylake?

Or something we can do with vpsrlvd? (But not vpsrlvw - that would require AVX-512).

If you have AVX512VBMI, vpmultishiftqb is a parallel bitfield-extract. You'd just need to shuffle the right 3-byte pairs into the right 64-bit SIMD elements, then one _mm256_multishift_epi64_epi8 to put the good bits where you want them, and a _mm256_and_si256 to zero the high 4 bits of each 16-bit field will do the trick. (Can't quite take care of everything with 0-masking, or shuffling some zeros into the input for multishift, because there won't be any contiguous with the low 12-bit field.) Or you could set up for just an srli_epi16 that works for both low and high, instead of needing an AND constant, by having the multishift bitfield-extract line up both output fields with the bits you want at the top of the 16-bit element.

This may also allow a shuffle with larger granularity than bytes, although vpermb is actually fast on CPUs with AVX512VBMI, and unfortunately Ice Lake's vpermw is slower than vpermb.

With AVX-512 but not AVX512VBMI, working in 256-bit chunks lets us do the same thing as AVX2 but avoiding the blend. Instead, use merge-masking for the right shift, or vpsrlvw with a control vector to only shift the odd elements. For 256-bit vectors, this is probably as good as vpmultishiftqb.

Thanks for the detailed explanation. It works, except one thing - I need to swap the nibbles of the communal (before the byte grouping) byte to get the right results. How can I fix this? — OC87, Feb 25 '21 at 08:29
As I understand, I need to create a mask for every 2nd byte - 'const __m256i vmask = _mm_set1_epi16(0x00ffff00)', than 't = _mm_and_si256(v, vmask);' after the shuffle. Now I need to swap the "endianness" and "merge" somehow the two variables. Is that correct? — OC87, Feb 25 '21 at 09:14
@OC87: By "communal", you mean the middle byte, that has bits from 2 separate fields? Does your naive / scalar C implementation need that fix, too? Maybe show it in the question to make it clear exactly which bits/bytes are coming from where. Because that sounds weird. In SIMD, hopefully a `_mm256_shuffle_epi8` at some point can swap pairs of bytes that hold the 4-bit part, or redesign the shifting. (Also maybe doing the shift on the AND result, instead of in parallel). — Peter Cordes, Feb 25 '21 at 09:29
@OC87: Note that `_mm_set1_epi16(0x00ffff00)` makes no sense; that constant doesn't fit in 16 bits. Perhaps you meant set1_epi32. But it's probably easier to swap nibbles after the `bytegrouping` shuffle, so you can use a 32-bit repeating pattern instead of 24-bit repeating. — Peter Cordes, Feb 25 '21 at 09:40
Peter, You're right. I don't need to swap the middle byte. Thanks! — OC87, Feb 28 '21 at 07:53
@OC87: Oh good; I'd been thinking about that occasionally and hadn't come up with any efficient way to rotate the odd bytes by 4 bits on input, or anything else along the way. Certainly possible, but would take more instructions, like maybe an extra 2 shifts and 3 bitwise (and/andnot/or) to blend bits, or even worse. So it probably would have made it about half the current speed. (Although could probably still keep up with DRAM. But hopefully you can cache block this so you're actually hitting in L3 or even L2 cache.) — Peter Cordes, Feb 28 '21 at 08:01
Thanks :) I face now additional problem. I want to be able to unpack two 10-bit fields. Now, the lower should be `__m256i lo10 = _mm256_and_si256(v, _mm256_set1_epi32(0x000003FF));`, But I'm struggle with the high fields. I tried to `__m256i hi10 = _mm256_srli_epi16(v, 6);` Is it make sense? — OC87, Feb 28 '21 at 10:52
@OC87: I guess you didn't find my answer on [Extract 10bits words from bitstream](https://stackoverflow.com/a/57616689) . That splits up packed pixels (like RGBA) into 4 separate planar outputs, so it would need some tweaking, but the design notes and code might be useful. Or maybe not since when we need to split things up, by plane, we can just mask away the bits we don't want before or after shifting. Good code for a single plane of 10-bit -> 16-bit might use a different strategy. [Keep only the 10 useful bits in 16-bit words](https://stackoverflow.com/q/66091979) is the other direction. — Peter Cordes, Feb 28 '21 at 11:04
Thanks for the informative post. I decided to go with the same way as 12-bit unpacking. So, I figure it out - for the lows: `__m256i lo10 = _mm256_and_si256(v, _mm256_set1_epi32(0x000003FF));` and the highs: `__m256i hi10 = _mm256_srli_epi16(v, 2);`, then: `hi10 = _mm256_and_si256(hi10, _mm256_set1_epi32(0x03FF0000));`. Thank you for your valuable help and support. — OC87, Feb 28 '21 at 11:22
@OC87: uh, that doesn't seem right if you have packed 10-bit data. LCM(10,16) is 80 bits, 10 bytes, for packed 10-bit chunks to get back to a 2-byte boundary. (And 20 bytes, 10 in each lane of input if you use the same unaligned load strategy as here, will expand to 32-bytes of output after unpacking). So you'll need different shift amounts for different 16-bit elements. Probably `_mm256_srlv_epi32` can be useful (per-element shift counts from a vector), but you'll wish you had AVX-512 `_mm256_srlv_epi16` for 16-bit instead of 32-bit granularity for it. — Peter Cordes, Feb 28 '21 at 11:34
I forget to mention that the packing is a bit different from the post you direct me to. Each two 10-bit fields (actually pixels) represented in 3 bytes. so the most significant nibble in each 3rd byte is zero. I hope it make sense now — OC87, Feb 28 '21 at 11:42
@OC87: Oh, yeah, so it's just a minor variation on this answer. I don't think you need any extra AND if you can count on those high bits of each 24-bit chunk actually being 0, just `hi = _mm256_srli_epi16(v, 2);`, if the layout is as you describe. (With `lo` from AND with `0x3FF`, and blend). In the `[ 0 e d c | d c b a ]` post-shuffle-layout comment, `c` holds the high 2 bits of `lo` and the low 2 bits of `hi`. (Unlike in the 12-bit case where it's all lo). A right-shift by 2 leaves six 0 bits at the top, and keeps the high 2 bits of the `c` nibble. So the high 10-bit field is `e:d:c[3:2]` — Peter Cordes, Feb 28 '21 at 12:04

SIMD unpack 12-bit fields to 16-bit

1 Answers1

Linked