Keep only the 10 useful bits in 16-bit words

Question

I have _m256i vectors that contain 10-bit words inside 16-bit integers (so 16*16-bit containing only 16*10 useful bits). What is the best/fastest way to extract only those 10-bits and pack them to produce an output bitstream of 10-bit values?

Here's the answer I was writing only to find out someone closed the question. https://gist.github.com/Const-me/1549e9540590862d5e0d1b558aeaada2 — Soonts, Feb 07 '21 at 20:43
@0___________ The one you linked is different. That question is from 2014, AVX2 was launched with Haswell in mid-2013, was irrelevant back then because very few people had the hardware. Also there’re no good answers there, all of them are very slow, that’s not how one should pack bits on modern computers. — Soonts, Feb 07 '21 at 22:03
@Soonts: You could post a new answer on the old question, using newer technologies. We don't always need separate questions for SSE2 vs. SSSE3 vs. AVX2 versions of identical problems, especially when they're not super common. But if none of the answers there are good, yeah I'll reopen this one. Maybe when the dust settles (especially if you want to include an SSE/AVX1 answer), we can close the old question as a dup of this. — Peter Cordes, Feb 08 '21 at 02:55
@Soonts: Did you check for other better duplicates? I wouldn't be surprised if this has been asked multiple times. I'll have a look now. e.g. [Efficiently packing 10-bit data on unaligned byte boundries](https://stackoverflow.com/q/34775546), but that's pure C++ :/ — Peter Cordes, Feb 08 '21 at 02:58
@PeterCordes See my answer. I think it’s borderline impossible to produce similar code with automatic vectorizers, especially the `vpshufb` step. — Soonts, Feb 08 '21 at 11:32
Just for the record, the original duplicate (with sub-optimal answers that don't use any shuffles) was [packing 10 bit values into a byte stream with SIMD](https://stackoverflow.com/q/23664015) — Peter Cordes, Feb 08 '21 at 20:19

Soonts · Accepted Answer · 2021-02-10T14:24:49.853

Here’s my attempt.

Have not benchmarked, but I think it should work pretty fast overall: not too many instructions, all of them have 1 cycle of latency on modern processors. Also the stores are efficient, 2 store instructions for 20 bytes of data.

The code only uses 3 constants. If you call this function in a loop, good compilers should load all three outside of the loop and keep them in registers.

// bitwise blend according to a mask
inline void combineHigh( __m256i& vec, __m256i high, const __m256i lowMask )
{
    vec = _mm256_and_si256( vec, lowMask );
    high = _mm256_andnot_si256( lowMask, high );
    vec = _mm256_or_si256( vec, high );
}

// Store 10-bit pieces from each of the 16-bit lanes of the AVX2 vector.
// The function writes 20 bytes to the pointer.
inline void store_10x16_avx2( __m256i v, uint8_t* rdi )
{
    // Pack pairs of 10 bits into 20, into 32-bit lanes
    __m256i high = _mm256_srli_epi32( v, 16 - 10 );
    const __m256i low10 = _mm256_set1_epi32( ( 1 << 10 ) - 1 ); // Bitmask of 10 lowest bits in 32-bit lanes
    combineHigh( v, high, low10 );

    // Now the vector contains 32-bit lanes with 20 payload bits / each
    // Pack pairs of 20 bits into 40, into 64-bit lanes
    high = _mm256_srli_epi64( v, 32 - 20 );
    const __m256i low20 = _mm256_set1_epi64x( ( 1 << 20 ) - 1 ); // Bitmask of 20 lowest bits in 64-bit lanes
    combineHigh( v, high, low20 );

    // Now the vector contains 64-bit lanes with 40 payload bits / each
    // 40 bits = 5 bytes, store initial 4 bytes of the result
    _mm_storeu_si32( rdi, _mm256_castsi256_si128( v ) );

    // Shuffle the remaining 16 bytes of payload into correct positions.
    // The indices of the payload bytes are [ 0 .. 4 ] and [ 8 .. 12 ]
    // _mm256_shuffle_epi8 can only move data within 16-byte lanes
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // 6 remaining payload bytes from the lower half of the vector
        4, 8, 9, 10, 11, 12,
        // 10 bytes gap, will be zeros
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        // 6 bytes gap, will be zeros
        -1, -1, -1, -1, -1, -1,
        // 10 payload bytes from the higher half of the vector
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12
    );
    v = _mm256_shuffle_epi8( v, shuffleIndices );

    // Combine and store the final 16 bytes of payload
    const __m128i low16 = _mm256_castsi256_si128( v );
    const __m128i high16 = _mm256_extracti128_si256( v, 1 );
    const __m128i result = _mm_or_si128( low16, high16 );
    _mm_storeu_si128( ( __m128i* )( rdi + 4 ), result );
}

This code truncates unused higher 6 bits of the values.

If you want to saturate instead, you’ll need one more instruction, _mm256_min_epu16.

Also, if you do that, the first step of the function can use pmaddwd. Here’s the complete function which saturates the source numbers, with couple extra adjustments.

// Store 10-bit pieces from 16-bit lanes of the AVX2 vector, with saturation.
// The function writes 20 bytes to the pointer.
inline void store_10x16_avx2( __m256i v, uint8_t* rdi )
{
    const __m256i low10 = _mm256_set1_epi16( ( 1 << 10 ) - 1 );
#if 0
    // Truncate higher 6 bits; pmaddwd won't truncate, it needs zeroes in the unused higher bits.
    v = _mm256_and_si256( v, low10 );
#else
    // Saturate numbers into the range instead of truncating
    v = _mm256_min_epu16( v, low10 );
#endif

    // Pack pairs of 10 bits into 20, into 32-bit lanes
    // pmaddwd computes a[ 0 ] * b[ 0 ] + a[ 1 ] * b[ 1 ] for pairs of 16-bit lanes, making a single 32-bit number out of two pairs.
    // Initializing multiplier with pairs of [ 1, 2^10 ] to implement bit shifts + packing
    const __m256i multiplier = _mm256_set1_epi32( 1 | ( 1 << ( 10 + 16 ) ) );
    v = _mm256_madd_epi16( v, multiplier );

    // Now the vector contains 32-bit lanes with 20 payload bits / each
    // Pack pairs of 20 bits into 40 in 64-bit lanes
    __m256i low = _mm256_slli_epi32( v, 12 );
    v = _mm256_blend_epi32( v, low, 0b01010101 );
    v = _mm256_srli_epi64( v, 12 );

    // Now the vector contains 64-bit lanes with 40 payload bits / each
    // 40 bits = 5 bytes, store initial 4 bytes of the result
    _mm_storeu_si32( rdi, _mm256_castsi256_si128( v ) );

    // Shuffle the remaining 16 bytes of payload into correct positions.
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // Lower half
        4, 8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        // Higher half
        -1, -1, -1, -1, -1, -1,
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12
    );
    v = _mm256_shuffle_epi8( v, shuffleIndices );

    // Combine and store the final 16 bytes of payload
    const __m128i low16 = _mm256_castsi256_si128( v );
    const __m128i high16 = _mm256_extracti128_si256( v, 1 );
    const __m128i result = _mm_or_si128( low16, high16 );
    _mm_storeu_si128( ( __m128i* )( rdi + 4 ), result );
}

This may be slightly faster or slower overall depending on the processor, compiler, and the code calling the function, but definitely helps with code size. No one cares about binary size anymore, but CPUs have limited L1I and µop caches.

For completeness here’s another one that uses SSE2 and optionally SSSE3 instead of AVX2, only slightly slower in practice.

// Compute v = ( v & lowMask ) | ( high & ( ~lowMask ) ), for 256 bits of data in two registers
inline void combineHigh( __m128i& v1, __m128i& v2, __m128i h1, __m128i h2, const __m128i lowMask )
{
    v1 = _mm_and_si128( v1, lowMask );
    v2 = _mm_and_si128( v2, lowMask );
    h1 = _mm_andnot_si128( lowMask, h1 );
    h2 = _mm_andnot_si128( lowMask, h2 );
    v1 = _mm_or_si128( v1, h1 );
    v2 = _mm_or_si128( v2, h2 );
}

inline void store_10x16_sse( __m128i v1, __m128i v2, uint8_t* rdi )
{
    // Pack pairs of 10 bits into 20, in 32-bit lanes
    __m128i h1 = _mm_srli_epi32( v1, 16 - 10 );
    __m128i h2 = _mm_srli_epi32( v2, 16 - 10 );
    const __m128i low10 = _mm_set1_epi32( ( 1 << 10 ) - 1 );
    combineHigh( v1, v2, h1, h2, low10 );

    // Pack pairs of 20 bits into 40, in 64-bit lanes
    h1 = _mm_srli_epi64( v1, 32 - 20 );
    h2 = _mm_srli_epi64( v2, 32 - 20 );
    const __m128i low20 = _mm_set1_epi64x( ( 1 << 20 ) - 1 );
    combineHigh( v1, v2, h1, h2, low20 );

#if 1
    // 40 bits is 5 bytes, for the final shuffle we use pshufb instruction from SSSE3 set
    // If you don't have SSSE3, below under `#else` there's SSE2-only workaround.
    const __m128i shuffleIndices = _mm_setr_epi8(
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1 );
    v1 = _mm_shuffle_epi8( v1, shuffleIndices );
    v2 = _mm_shuffle_epi8( v2, shuffleIndices );
#else
    // SSE2-only version of the above, uses 8 instructions + 2 constants to emulate 2 instructions + 1 constant
    // Need two constants because after this step we want zeros in the unused higher 6 bytes.
    h1 = _mm_srli_si128( v1, 3 );
    h2 = _mm_srli_si128( v2, 3 );
    const __m128i low40 = _mm_setr_epi8( -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 );
    const __m128i high40 = _mm_setr_epi8( 0, 0, 0, 0, 0, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0 );
    const __m128i l1 = _mm_and_si128( v1, low40 );
    const __m128i l2 = _mm_and_si128( v2, low40 );
    h1 = _mm_and_si128( h1, high40 );
    h2 = _mm_and_si128( h2, high40 );
    v1 = _mm_or_si128( h1, l1 );
    v2 = _mm_or_si128( h2, l2 );
#endif

    // Now v1 and v2 vectors contain densely packed 10 bytes / each.
    // Produce final result: 16 bytes in the low part, 4 bytes in the high part
    __m128i low16 = _mm_or_si128( v1, _mm_slli_si128( v2, 10 ) );
    __m128i high16 = _mm_srli_si128( v2, 6 );
    // Store these 20 bytes with 2 instructions
    _mm_storeu_si128( ( __m128i* )rdi, low16 );
    _mm_storeu_si32( rdi + 16, high16 );
}

I think two overlapping 8-byte stores would save some shuffle work, at least if you're doing this in a loop and can thus write past the end of the 10-byte block. — Peter Cordes, Feb 08 '21 at 20:24
@PeterCordes You probably mean two overlapping 16-byte stores? Indeed, if there’s a guarantee the destination is at least 26 bytes long, this saves 3 instructions. For SSE2-only version even saves one constant vector register, no need to zero out unused higher 6 bytes, can call `combineHigh` for the third time instead. — Soonts, Feb 08 '21 at 22:45
Yes, right, 16-byte stores. Of course, LCM(20,16) is 80, so for very large arrays it might (or might not) be worth it to unroll enough to never need any overlap or narrow stores. OTOH with a shuffle bottleneck, overlapping 16-byte stores may be cheaper, especially `vmovdqu [mem], xmm` / `vextracti128 [mem+10], ymm, 1` for the AVX2 case. (So wouldn't that be saving *4* instructions? The 2 byte-shifts and OR, but also combining vextract with a store. On Intel, `vextracti128 m,r,i` doesn't micro-fuse, so it costs the same as 2 front-end instructions, but still no shuffle; But great on Zen) — Peter Cordes, Feb 08 '21 at 23:09
Also, one OR+byteshift could be replaced with `palignr`, or the `[v]pshufb` could actually set up for a palignr in the first place by putting some of the data at the high end of the register. Or possibly even with stuff aligned to dword boundaries for `shufps` to grab parts of 2 regs? — Peter Cordes, Feb 08 '21 at 23:11
IDK if there's a good way to set up for an efficient `movhps` to store the high half of a reg with a single uop, like maybe 16 + 8 overlapping by 4. But with 10 bytes from each 16-bit half, that would require some kind of blend or OR to set up data for the 2nd one. Still maybe cheaper to set up for than 2 overlapping 16-byte stores within a 20-byte chunk, for a version that doesn't write beyond the output. Still more expensive so not what you'd want in a loop. — Peter Cordes, Feb 08 '21 at 23:25
Also worth mentioning that AVX-512 VBMI [`vpmultishiftqb`](https://www.felixcloutier.com/x86/vpmultishiftqb) may be usable for this. Although IIRC it can only grab contiguous (unaligned) 8-bit source chunks into aligned 8-bit destination spots. So you might need to use it twice with different control vectors to get both sides of gaps, and then bit-blend (vpternlogd), possibly with an AVX-512 variable-count `vpsllvw` in there somewhere to line things up. Lane-crossing AVX512VBMI `vpermb` solves all the `vpshufb` + fixup problems. — Peter Cordes, Feb 08 '21 at 23:43
Another idea: the first packing step can use SSE2 `pmaddwd` with `1` and `1<<10` multiplier pairs to concatenate pairs of 10-bit numbers (assuming they're correctly zero-extended). With both constants left-shifted, in alternating pairs of pairs, it can pack towards the high end instead of low, making it contiguous within one qword! — Peter Cordes, Feb 08 '21 at 23:51
@PeterCordes Good idea about palignr, updated. I have only changed AVX2 version however; for SSE it would consume another constant register, unlike AVX2 this may or may not be a good enough tradeoff. — Soonts, Feb 08 '21 at 23:51
About AVX-512, feel free to edit and mention. I have zero experience with it, can’t even test anything because using AMD Zen2 CPU. About `pmaddwd`, it might save a cycle on AMD, but on Intel it’s rather slow at 5 cycles latency, probably more than my current code. Also I think it breaks if negative int16 numbers are passed on input. — Soonts, Feb 09 '21 at 00:09
Why would latency matter? The normal use-case is throughput bound, doing independent work for each 32-byte chunk of source data. PMADDWD should replace the first 7 instructions with 2 (pmaddwd + psrlq), and thus be only slightly more latency. Or 3 if you still need to AND away high garbage (to make unsigned). Also, if you ever want to play with AVX-512, Intel's SDE is easy to use, just `sde64 ./myprog`. Or test it on Godbolt which runs on SKX / Cascade-Lake. But in this case, I think `vpmaddwd` is better than `vpmultishiftqb`. (If I do make an AVX512 version, I'd post a separate answer.) — Peter Cordes, Feb 09 '21 at 00:37
I posted an answer with some uops numbers. Looks like yours has 8 uops before the vpshufb; mine reduces that to 3 (or 2 if you don't need to AND away high garbage). After the vpshufb I also used the partial-overlap strategy, 3 front-end uops down from 5. But none of them are shuffles vs. yours having 3 shuffles there which could be a bottleneck when combined with vpshufb. So the write-past-end strategy is significant if doing only this in a loop. — Peter Cordes, Feb 09 '21 at 02:00
I was able to eliminate one of the shuffles, only two are left, `vpshufb` and `vextracti128` — Soonts, Feb 09 '21 at 15:52
More constants would be fine if compilers were consistently better about compacting them to load with a broadcast-load. I'd prefer `vpsllvd` over `vpslld` / `vpblendd` for a loop body, unless this was only going to run a few iterations occasionally. If you're loading one constant, loading others hopefully come from the same cache line. (Or pair of lines if compilers are dumb and load full 32-byte constants instead of a broadcast.) — Peter Cordes, Feb 10 '21 at 21:44
@PeterCordes & Soonts thanks to both of you, I don't understand all the subtleties but I'm learning, anyway this function works great, here's how I integrated it: https://pastebin.com/UZubEYW1 (this code interleave 10 bits pixels from uint16_t buffers, 2 Y for 1 U and 1 V) I got a factor of 1.8 compared to the C code automatically optimised by the compiler! — poypoy, Feb 18 '21 at 14:01
@poypoy: As I suspected, you're looping over many pixels so you'd benefit from the overlapping-store technique in my answer. Even moreso with all the extra shuffles you do before calling this function; fewer shuffle uops in my version mean the shuffle-port bottleneck is less severe a problem (especially on Haswell/Skylake). I'm not sure if you can save any more shuffles on your input-setup side, e.g. with 128-bit load / insert (`vinserti128 ymm, [mem], 1`). If you can correct for the in-lane behaviour of `_mm256_unpacklo_epi16` by feeding it mixed input, instead of shuffling its output... — Peter Cordes, Feb 18 '21 at 14:11
@poypoy I think you should use `_mm256_permute2x128_si256` instead of `_mm256_permute2f128_si256` in that code. They take same arguments and return the same result, however ` vperm2i128` is faster on many CPUs when the surrounding code uses integer instructions for these vectors. More info: https://stackoverflow.com/a/53673809/126995 — Soonts, Feb 18 '21 at 14:21
@poypoy Also test both versions, the first one which uses bitwise blending, and the second one which does `pmaddwd`; switch the `#if 0` intro `#if 1` if you want to truncate the higher bits of the values. — Soonts, Feb 18 '21 at 14:23
Well, after testing the 3 functions (both yours and the one from @PeterCordes), it seems that the fastest is your second but without truncating or saturating values, I just removed this step since padding bits are already zeroes. I followed your advice to change _mm256_permute2f128_si256 by _mm256_permute2x128_si256, it seems it doesn't have any performance impact but I understand how it can be better. Strangely, the multithread scaling of the avx2 version is not as good as the C version (and 2 threads are less efficient that one), i think it's a bandwith limitation, any ideas about this? — poypoy, Feb 19 '21 at 14:37
@poypoy “the fastest is your second but without truncating or saturating values” I would expect them to be roughly the same. Can you confirm the compiler inlined all 4 calls to `store_10x16_avx2`? If it failed, the second version without truncation/saturation uses 2 less constant loads which would explain your results. If that’s the case, force the inlining. In clang or gcc I usually `#define __forceinline inline __attribute__((always_inline))`, in VC++ `__forceinline` is already a keyword. — Soonts, Feb 21 '21 at 14:52
Sorry, it’s 1 less constant load, for `pmaddwd` and `vpshufb`, but still, when not inlined, that’s 2 versus 3 loads per call. The correct count, when inlined, is zero loads. — Soonts, Feb 21 '21 at 15:04
After testing 8 *10000 times the complete processing, I confirm that your second function is slightly faster than your first (1: mean 2.943ms, stddev 0.014; 2: mean 2.884ms, stddev 0.02). I'm using a zen2 processor and gcc 9.3. Force inlining did change nothing, I suppose it was already made correctly. — poypoy, Feb 24 '21 at 13:19
@poypoy “I confirm that your second function is slightly faster than your first” Interesting. Here’s one more version: https://gist.github.com/Const-me/f6d045e51f7304f256f3963713fde554 It only uses a single vector constant, also all instructions there have 1 cycle of latency on Zen2. — Soonts, Feb 25 '21 at 00:07
This version is about 6% slower. Just one precision, since _mm_storeu_si32 is not defined in my compiler (gcc 9.3), I use a workaround with _mm_store_ss((float*)mem_addr, _mm_castsi128_ps(a)), can it have an impact on performance? — poypoy, Feb 26 '21 at 15:22
@poypoy It can on some CPUs, but I don’t think Zen2 is one of them. The correct workaround is this: `#define _mm_storeu_si32(p, a) (void)(*(int*)(p) = _mm_cvtsi128_si32((a)))` (that code is from imminitrin.h in MSVC) — Soonts, Feb 26 '21 at 20:15

Peter Cordes · Answer 2 · 2021-02-10T05:33:20.517

In a loop, you may want to use partially-overlapping stores that write past the end of the 20 byte destination for each vector of source data. That saves the work of shuffling data across the 16-byte boundary to set up for 16 + 4 byte stores.

(@Soont's updated answer with one vmovd and one vmovdqu store is very good and only has 2 total shuffle uops including vpshufb and vextracti128. When I initially wrote this, we hadn't yet thought of a good way to avoid storing outside the 20 bytes without spending more shuffle uops which would create a bottleneck worse than the front-end. But vmovdqu + vextracti128 mem, ymm, 1 (2 uops not micro-fused) is still slightly cheaper: 3 uops after the vpshufb instead of 4.)

Or unrolling could be good for large arrays, LCM(20,16) = 80, so with a large unroll (and different shuffle-control vectors for each position within that) you could be doing only aligned 16-byte stores. But that might take a lot of shuffling, including between source chunks probably with palignr.

Example of two overlapping 16-byte stores

Use this as a loop body where overwriting past 20 bytes is ok.

#include <immintrin.h>
#include <stdint.h>

// Store 10-bit pieces from each of the 16-bit lanes of the AVX2 vector.
// The function writes 20 useful bytes to the pointer
// but actually steps on data out to 26 bytes from dst
void pack10bit_avx2_store26( __m256i v, uint8_t* dst)
{
    // clear high garbage if elements aren't already zero-extended   
    //v = _mm256_and_si256(v, _mm256_set1_epi16( (1<<10)-1) );

    ... prep data somehow; pmaddwd + a couple shifts is good for throughput

    // Now the vector contains 64-bit lanes with 40 payload bits / each; 40 bits = 5 bytes.
    // Shuffle these bytes into a very special order.
    // Note _mm256_shuffle_epi8 can only move data within 16-byte lanes.
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // 6 bytes gap with zeros
        // Pack the two 5-byte chunks into the bottom of each 16-byte lane
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1,

        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1);
    v = _mm256_shuffle_epi8(v, shuffleIndices );

    // Split the vector into halves
    __m128i low16 = _mm256_castsi256_si128( v );
    _mm_storeu_si128( ( __m128i* )dst, low16 );        // vmovdqu      mem, xmm

    __m128i high16 = _mm256_extracti128_si256( v, 1 );
    _mm_storeu_si128( ( __m128i* )(dst+10), high16 );   // vextracti128 mem, ymm, 1

    // An AVX-512 masked store could avoid writing past the end
}

We can see how it might inline into a loop by compiling it to a stand-alone function (https://godbolt.org/z/8T7KhT).

# clang -O3 -march=skylake
pack10bit_avx2(long long __vector(4), unsigned char*):
       # vpand  commented out
        vpmaddwd        ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
         ... # work in progress, original PMADDWD idea ignored some limitations!  See Soonts' answer

        vpshufb ymm0, ymm0, ymmword ptr [rip + .LCPI0_1] # ymm0 = ymm0[0,1,2,3,4,8,9,10,11,12],zero,zero,zero,zero,zero,zero,ymm0[16,17,18,19,20,24,25,26,27,28],zero,zero,zero,zero,zero,zero
        vmovdqu xmmword ptr [rdi], xmm0
        vextracti128    xmmword ptr [rdi + 10], ymm0, 1

        vzeroupper               # overhead that goes away when inlining into a loop
        ret

In a loop, compilers would load those 2 vector constants into registers, hopefully using broadcast-loads.

Unlike some wider integer multiplies or horizontal add, vpmaddwd is handed efficiently, as a single uop with 5 cycle latency. https://uops.info/

The vextracti128 store can't micro-fuse on Intel, but unlike vpextrd there's no shuffle uop involved. Just store-address and store-data. Zen2 also runs it as 2 uops, with throughput of one per 2 cycles unfortunately. (Worse than Zen1).

Before Ice Lake, both Intel and AMD can run 1 store per clock.

If you do actually want the packed data back in registers, you might want @Soont's original shuffles using palignr, or you could do a block of this and then some reloads. Latency would be higher (especially because of store-forwarding stalls on the reloads), but if your block is several registers worth of data then that should overlap or even hide the latency, maybe giving the stores time to even commit to L1d and not cause a stall when reloaded.

BMI2 `pext`

uint64_t packed = _pext_u64(x, 0x03FF03FF03FF03FF);

Maybe good for scalar cleanup or a short chunk of 4 pixels or whatever. This leaves you with the problem of doing a 5-byte store (or 8-byte store with trailing zeros). Beware of strict-aliasing and alignment if using this, e.g. use memcpy to get unaligned may-alias data into a uint64_t, or make an __attribute__((aligned(1),may_alias)) typedef.

pext is very efficient on Intel (1 uop, 3c latency), but very bad on AMD, much worse than just using the low part of one SIMD step.

AVX-512

AVX512VBMI (Ice Lake) would give you vpermb (lane crossing) instead of vpshufb. (AVX512BW for vpermw on Skylake-X / Cascade Lake would require you to have already combined into an even number of bytes, and it's 2 uops even on Ice Lake where vpermb is 1, so it's pretty bad.) vpermb could set up for a single unaligned 32-byte store (with 20 useful bytes), which you overlap in a loop.

AVX-512 stores can be efficiently masked to not actually overwrite past the end, e.g. using dword masking. vmovdqu32 [rdi]{k}, ymm0 is 1 uop on Skylake-X. But AVX2 vmaskmovd is a few uops even on Intel, and extremely expensive on AMD, so you don't want to do that. And dword masking only works if you have all 20 bytes ready for one store, otherwise you need at least 16-bit granularity.

Other AVX-512 instructions: VBMI vpmultishiftqb, a parallel bitfield extract, seems like it might be useful, but it can only write aligned 8-bit destination chunks from unaligned but contiguous source chunks. I don't think that's better than what we can do with variable-shifts and rotates. vpmultishiftqb would let us unpack this format (inverse of this function) in probably 2 instructions: 1 shuffle (such as vpexpandb or vpermb) to put the needed data into each qword in the vector, and one multishift to grab the right 10-bit field for the bottom of each word.

AVX-512 has variable-count shifts and rotates, including with word (16-bit) granularity, so that would be an option instead of vpmaddwd for the first step. Using shifts ignores high garbage for free. It has lower latency, and merge-masking for the immediate version can replace the need for a control vector. (But then you need a mask constant).

With masking the latency is 3 cycles, vs 1 without, and AVX-512 makes it about as efficient to broadcast a control vector from an immediate as to mov reg,imm / kmov kreg, reg. e.g. mov reg,imm / vpbroadcastd ymm, reg (1 uop). Merge-masking also constrains the optimizer to overwrite the destination register instead of copy-and-shift, although that shouldn't matter here if the optimizer is smart. Neither way lets the load of the data fold into a memory source operand for the shift: sllvw can only take the counts from memory, and sllw needs to merge into the original in a register.

Shifts can run on ports 0 or 1 on Intel (and AMD doesn't support AVX-512). Or only port 0 for 512-bit uops, shutting down port 1 for any vector-ALU uop while any 512-bit uops are in flight. So there's a potential throughput bottleneck on port 0 for a __m512i version of this, but for 256-bit there are enough other uops (shuffle and store, and presumably loop overhead if doing this for an array of data) that this should be fairly evenly distributed.

This shift part (before _mm256_permutexvar_epi8) only requires AVX-512BW (+VL), and will work on Skylake-X. It leaves the data in the same place as other methods, so is a drop-in replacement you can mix and match with various strategies.

// Ice Lake.  Could work on __m512i but then shifts could only run on p0, not p0/p1,
  //  and almost every store would be a cache line split.
inline void store_10x16_avx512vbmi( __m256i v, uint8_t* dst )
{
// no _mm256_and_si256 needed, we safely ignore high bits
   // v = [ ?(6) ... B[9:0] | ?(6) ... A[9:0] ] repeated
   v = _mm256_sllv_epi16(v, _mm256_set1_epi32((0<<16) | 6));  // alternative: simple repeated-pattern control vector
      // v =  _mm256_mask_slli_epi16(v, 0x5555, v, 6);   // merge-masking, updating only elements 0,2, etc.
   // v = [ ?(6) ... B[9:0] | A[9:0] ... 0(6) ] repeated
   v = _mm256_rolv_epi32(v, _mm256_set1_epi64x(((32ULL-6)<<32) | 6));  // top half right, bottom half left
   // v = [ 0(6) .. ?(6) .. D[9:0] | C[9:0] | B[9:0] | A[9:0] ... 0(12) ] repeated
   v = _mm256_srli_epi64(v, 12);    // 40 bit chunks at the bottom of each qword

   const __m256i permb = _mm256_setr_epi8( 0, 1, 2, 3, 4,   8, 9,10,11,12,
                                          16,17,18,19,20,  24,25,26,27,28,
                                          28,28,28,28,28,28,28,28,28,28,28,28 );
    // repeat last byte as filler.  vpermb can't zero (except by maskz) but we can do a masked store
   v = _mm256_permutexvar_epi8(v, permb);  // AVX512_VBMI
   _mm256_mask_storeu_epi32( dst, 0x1F, v);  // 32-bit masking granularity in case that's cheaper for HW.  20 bytes = 5 dwords.
}

Compiles like so (Godbolt):

# clang -O3 -march=icelake-client.  GCC is essentially the same.
store_10x16_avx512vbmi(long long __vector(4), unsigned char*):
        vpsllvw ymm0, ymm0, ymmword ptr [rip + .LCPI0_0]
        vprolvd ymm0, ymm0, ymmword ptr [rip + .LCPI0_1]
        vpsrlq  ymm0, ymm0, 12
        vpermb  ymm0, ymm0, ymmword ptr [rip + .LCPI0_2]
        mov     al, 31           # what the heck, clang? partial register false dependency for no reason!
        kmovd   k1, eax
        vmovdqu32       ymmword ptr [rdi] {k1}, ymm0
      # vzeroupper not needed because the caller was using __m256i args.  GCC omits it.
        ret

Even if you use the same shift constant vector twice to make the compiler keep it around in a register (instead of use directly from a memory source operand), it still chooses to load it from memory instead of mov eax,6 / vpbroadcast ymm1, eax or something. This saves 1 uop at the cost of needing the constant in .rodata. To be fair, we do need other constants probably in the same cache line, but the way GCC wastes space they don't all fit in one cache line! clang notices the pattern and uses a vpbroadcastd or q load, gcc wastefully loads a full 32 bytes. (kmov k1, [mem] is 3 front-end uops so it wouldn't save a uop to load mask constants from memory.)

Using _mm256_mask_slli_epi16(v, 0x5555, v, 6), clang optimizes it back into vpsllvw ymm0, ymm0, ymmword ptr [rip + .LCPI0_0] with the same 6,0 repeating constant. So I guess that's a good sign I got it right. But GCC compiles as written:

store_10x16_avx512vbmi(long long __vector(4), unsigned char*):
        mov     eax, 21845
        kmovw   k1, eax
        vpsllw  ymm0{k1}, ymm0, 6
        vprolvd ymm0, ymm0, YMMWORD PTR .LC0[rip]
        mov     eax, 31
        kmovb   k2, eax
        vpsrlq  ymm0, ymm0, 12
        vpermb  ymm0, ymm0, YMMWORD PTR .LC1[rip]
        vmovdqu32       YMMWORD PTR [rdi]{k2}, ymm0
        ret

_mm256_sllv_epi16 requires AVX-512BW and AVX-512VL. rolv_epi32 only requires AVX-512VL. (Or just AVX-512F for the 512-bit version.) Rotates only come in 32 and 64 element sizes, not 16, but AVX-512 does extend variable-shift granularity down to 16 (from 32 or 64 in AVX2).

vpcompressb [rdi]{k1}, ymm0 (AVX512VBMI = Ice Lake and later) would be an alternative to vpermb + store to pack bytes at the bottom of a register (like BMI2 pext but for vector elements instead of bits in a scalar register). But it's actually more expensive: 6 uops on Ice Lake, with one per 6c throughput. (vpcompressd is not as bad).

Even vpcompressb into a vector register is 2 uops, so for a constant shuffle-control it's better to load a vector constant for vpermb, unless cache misses for control vectors is a problem, e.g. if you're only doing this once every so often then let the HW process a k mask instead of a load.

AVX-512 without VBMI: 2x 16-byte stores without exceeding the 20-byte range

   ...  // same setup as usual, leaving 40-bit chunks at the bottom of each qword
 
    const __m256i shuffleIndices = _mm256_setr_epi8(
        // 6 bytes gap with zeros
        // Pack the two 5-byte chunks into the bottom of each 16-byte lane
        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1,

        0, 1, 2, 3, 4,
        8, 9, 10, 11, 12,
        -1, -1, -1, -1, -1, -1);
    v = _mm256_shuffle_epi8(v, shuffleIndices );

    // Split the vector into halves
    __m128i low16 = _mm256_castsi256_si128( v );
    _mm_storeu_si128( ( __m128i* )dst, low16 );        // vmovdqu      mem, xmm  no masking

    // An AVX-512BW masked store avoiding writing past the end costs more instructions (and back-end uops), same front-end uops
    __m128i high16 = _mm256_extracti128_si256( v, 1 );  // vextracti128 xmm, ymm, 1
    _mm_mask_storeu_epi8( dst+10, 0x3FF, high16 );      // vmovdqu8 [mem]{k}, xmm

This needs vextracti128 xmm, ymm, 1 to set up for vmovdqu8. Unlike with writing 26 bytes, we can't extract directly to memory. There is no vextracti8x16, only vextracti32x4 and 64x2 (and 32x8 / 64x4 256-bit extracts). We need byte-granularity masking but can't get it with an instruction that extracts directly to memory, only via a shuffle (vextract into a register) and then vmovdqu8.

So the asm we get is

# clang
...        vpshufb result in YMM0
        vmovdqu      [rdi], xmm0             # same as before
        vextracti128    xmm0, ymm0, 1        # 1 shuffle uop
        mov     ax, 1023
        kmovd   k1, eax                         # will be hoisted
        vmovdqu8     [rdi + 10] {k1}, xmm0   # 1 micro-fused uop

Since vextracti128 [mem], ymm, 1 was 2 front-end uops anyway, this doesn't hurt front-end throughput. (It does create more pressure on back-end execution ports, thanks to the shuffle uop).

Your code doesn’t work. Maximum amount of left shift you can do with `pmaddwd` is by 14 bits (because 0x8000 is a negative number). To pack values into the high positions you’re trying to shift by 22 bits. `pmaddwd` may only help with the first reduction step, that merges pairs of 10-bit values into 20 bits in 32-bit lanes. It works for that first step, but I don’t believe that’s faster than my 4 instructions psrld/pand/pandn/por, at least not on Intel. — Soonts, Feb 09 '21 at 11:22
@Soonts: oh damn, you're right. At least with AVX2 `vpsllvd` we can efficiently left-shift alternate dwords to set up for a qword right-shift. One `pmaddwd` is still better for throughput than 4 boolean/shift instructions, especially on Intel where `pmaddwd` can run on either of 2 different ports! I don't understand your focus on latency here; most SIMD code is executed in a loop body that's not part of a loop-carried dependency, including this particular code where the OP says they have multiple `__m256i` vectors to process. Front-end and/or shuffle-port will be the bottleneck. — Peter Cordes, Feb 09 '21 at 11:40
It’s not just the latency, `pmaddwd` needs inputs to be truncated or saturated, which uses extra instruction + constant. Automatic truncation is what people using this function are likely to expect, garbage on output is unexpected. Same applies to writing 26 bytes to memory for 20 bytes of data. https://en.wikipedia.org/wiki/Robustness_principle — Soonts, Feb 09 '21 at 12:59
@Soonts: ok, so 2 uops instead of 1 then, if we need to zero out high garbage before `pmaddwd`. Still better than 4. (I think you're right that the OP may be implying there's high garbage on input, but being able to save a uop if there isn't is nice.) Re: writing 26 bytes: that's why you can gain efficiency when you think in terms of a loop body instead of a single iteration, and in a real codebase you'd definitely put this inside a loop, not pretending to be a good stand-alone single-vector implementation. — Peter Cordes, Feb 09 '21 at 21:26
(Although interesting idea to reduce shuffles: `movd` is cheaper than `pextrd` so I'd considered suggesting looking for that anyway, but hadn't realized just how good it could be; this nicely avoids a lot of shuffle uops. 2 cycle throughput from both stores and shuffle bottlenecks on Intel pre-ICL and also Zen, if the whole thing part can get through the front-end fast enough and there aren't other bottlenecks. e.g. `vpsllvd` is multi-uop on Haswell, but single-uop on Zen (well, per lane on zen1). Zen1/2 has 3 cycle latency for variable-shift, and Zen2 only runs it on the FP2 port.) — Peter Cordes, Feb 09 '21 at 21:34
@Soonts: Thanks again for catching that overly optimistic use of pmaddwd. Updated this answer to just focus on the overlapping-store idea, and on AVX-512. `sllv_epi16` and `rolv_epi32` (rotate) are interesting, avoiding need to any clearing of garbage. — Peter Cordes, Feb 10 '21 at 05:30
Good point on `vpsllvd`, reworked to use 2 faster instructions instead, `vpslld` and `vpblendd`. The same approach can be used for the first step too, this eliminates need of the constant there, however might become too many shifts then. — Soonts, Feb 10 '21 at 14:29

Keep only the 10 useful bits in 16-bit words

2 Answers2

Example of two overlapping 16-byte stores

BMI2 `pext`

AVX-512

AVX-512 without VBMI: 2x 16-byte stores without exceeding the 20-byte range

Linked

Related

Keep only the 10 useful bits in 16-bit words

2 Answers2

Example of two overlapping 16-byte stores

BMI2 pext

AVX-512

AVX-512 without VBMI: 2x 16-byte stores without exceeding the 20-byte range

Linked

Related

BMI2 `pext`