Fastest way to unpack 32 bits to a 32 byte SIMD vector

Question

Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte.

Edit: to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the byte on zero. Best I could at the moment is 2 PSHUFB and having a mask register for each position.

If the uint32_t is a bitmap, then the corresponding vector elements should be 0 or non-0. (i.e. so we could get a vector mask with a vpcmpeqb against a vector of all-zero).

https://software.intel.com/en-us/forums/topic/283382

What language are you using? Is there some approach that you have tried that is too slow? — John Gietzen, Jun 15 '14 at 02:10
C with Intel intrinsics. I tried the obvious approaches: broadcast the u32, then either variable shift or multiplication to shift each u32. But it starts to get complicated and would need several registers for mask. Then merge. I think I saw something similar a couple of years ago, in some assembly for video codecs or something. — alecco, Jun 15 '14 at 02:15
Broadast first. With AVX2 then use _mm256_and_si256. With AVX you need to split the lanes, do _mm_and_si128 twice, then join high and low. — Z boson, Jun 15 '14 at 11:26
@alecco, I posted an answer to do this with AVX. It would be a bit simpler with AVX2. — Z boson, Jun 16 '14 at 11:31
AVX512BW: `VPMOVM2B ymm1, k1`: sets each byte of `ymm1` to 0 or -1, according to the corresponding bit in `k1`. If the mask wasn't already in a mask register, then you also need a `KMOVD k1, k2/m32` or `KMOVD k1, r32`. Obviously you can do this with 64bit masks into 512b zmm registers, too. — Peter Cordes, Feb 25 '16 at 01:55
AVX2 duplicate (with the same answer which looks optimal): https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb. Maybe leaving this open for the 128b AVX version. — Peter Cordes, Nov 19 '17 at 06:32

score 16 · Accepted Answer · edited May 08 '20 at 16:51

To "broadcast" the 32 bits of a 32-bit integer x to 32 bytes of a 256-bit YMM register z or 16 bytes of a two 128-bit XMM registers z_low and z_high you can do the following.

With AVX2:

__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);

Without AVX2 it's best to do this with SSE:

__m128i y = _mm_set1_epi32(x);      
__m128i z_low  = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high); 
z_low  = _mm_and_si128(z_low ,mask2);
z_high = _mm_and_si128(z_high,mask2);

The masks and a working example are shown below. If you plan to do this several times you should probably define the masks outside of the main loop.

#include <immintrin.h>
#include <stdio.h>

int main() {
    int x = 0x87654321;

    static const char mask1a[32] = {
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x01, 0x01, 0x01, 0x01,
        0x01, 0x01, 0x01, 0x01,
        0x02, 0x02, 0x02, 0x02,
        0x02, 0x02, 0x02, 0x02,
        0x03, 0x03, 0x03, 0x03,
        0x03, 0x03, 0x03, 0x03
    };

    static const char mask2a[32] = {
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
    };

char out[32];

#if defined ( __AVX2__ )
    __m256i mask2 = _mm256_loadu_si256((__m256i*)mask2a);
    __m256i mask1  = _mm256_loadu_si256((__m256i*)mask1a);

    __m256i y =    _mm256_set1_epi32(x);
    __m256i z =    _mm256_shuffle_epi8(y,mask1);
    z = _mm256_and_si256(z,mask2);

    _mm256_storeu_si256((__m256i*)out,z);

#else
    __m128i mask2 = _mm_loadu_si128((__m128i*)mask2a);
    __m128i mask_low  = _mm_loadu_si128((__m128i*)&mask1a[ 0]);
    __m128i mask_high = _mm_loadu_si128((__m128i*)&mask1a[16]);    

    __m128i y = _mm_set1_epi32(x); 
    __m128i z_low  = _mm_shuffle_epi8(y,mask_low);
    __m128i z_high = _mm_shuffle_epi8(y,mask_high);
    z_low  = _mm_and_si128(z_low,mask2);
    z_high = _mm_and_si128(z_high,mask2);

    _mm_storeu_si128((__m128i*)&out[ 0],z_low);
    _mm_storeu_si128((__m128i*)&out[16],z_high);
#endif
    for(int i=0; i<8; i++) {
        for(int j=0; j<4; j++) {        
            printf("%x ", out[4*i+j]);
        }printf("\n");
    } printf("\n");
}

To get 0 or -1 in each vector element:

It takes one extra step _mm256_cmpeq_epi8 against all-zeros. Any non-zero turns into 0, and zero turns into -1. If we don't want this inversion, use andnot instead of and. It inverts its first operand.

__m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);    // we only use the low 32bits of each lane, but this is fine with AVX2

    // Each byte gets the source byte containing the corresponding bit
    __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf  = _mm256_shuffle_epi8(xbcast, shufmask);

    __m256i andmask  = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);

    // this is the extra step: compare each byte == 0 to produce 0 or -1
    return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
     // alternative: compare against the AND mask to get 0 or -1,
     // avoiding the need for a vector zero constant.
}

See it on the Godbolt Compiler Explorer.

Also see is there an inverse instruction to the movemask instruction in intel avx2? for other element sizes.

Could you give some idea on how you'd do it with AVX2? Thanks! — alecco, Jun 16 '14 at 19:55
@alecco, I updated my answer showing how to do this with AVX2. I tested it as well. — Z boson, Jun 17 '14 at 09:25
You're a star! Thanks a lot. Wish I could give you more upvotes. — alecco, Jun 17 '14 at 14:00
If you want the resulting bytes to be 0 or -1 (so each bit of the mask expands to all bits of the vector byte), you need one more step. After the shuffle, use `andn` instead of `and` (inverting `y`). Then use a `_mm256_cmpeq_epi8` against a vector of all-zeros to invert again. — Peter Cordes, May 26 '16 at 22:44
Also, I'd write mask2a as `_mm256_set1_epi64x(0x80'40'20'10'08'04'02'01)`. (The C++14 `'` separators for readability are totally optional.) To make it easy to select 128 vs. 256, you can use a `_mm_set1_epi64x()` and then the AVX2 version can use `_mm256_set_m128i(same,same)`. It all optimizes away at compile time. — Peter Cordes, May 26 '16 at 22:58
For mask1, I'd highly recommend using `_mm_set` rather than a load. You absolutely don't want your function to compile into scalar immediate stores to the stack, and then a vector load! Using `_mm_set` for constants allows sharing of the constant between multiple uses, exactly like string literal merging (probably even using the same compiler logic). I don't see a nice way to write it more compactly, though, except with a CPP macro to repeat it's argument 8 times. — Peter Cordes, May 26 '16 at 23:00
@PeterCordes, that's why I said " If you plan to do this several times you should probably define the masks outside of the main loop." — Z boson, May 27 '16 at 09:08
@PeterCordes, BTW, feel free to edit my answer with your suggestion but I prefer that you append your solution and not change what I have written. Just a horizontal marker and write after that if you like. — Z boson, May 27 '16 at 09:32
@Zboson: I wasn't worried about it being hoisted or not, I was saying that it will waste instructions copying the data onto the stack, instead of just loading from a static constant. Look at the asm from your version converted to a function taking an `int` arg, but still printing: https://godbolt.org/g/HfYeMd. Note the `vmovdqa YMMWORD PTR [rbp-144], ymm0`. It's storing the constant it just loaded from `.LC0`. You can get rid of this with `static const char []`, but that still defeats constant-pool merging if this function was inlined into multiple files. **Just use `_mm_set`** — Peter Cordes, May 27 '16 at 15:36
There's [a duplicate of this that has the same the strategy](https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb). (Also suggesting an OR with a mask that has one bit *unset* and `vpcmpeqb` against `set1(0xFF)`, but an all-zeros vector is slightly cheaper to than all-ones) Not sure if I should close it. I was looking for a non-AVX version to link. I guess this has a non-AVX2 version at least. — Peter Cordes, Nov 19 '17 at 06:30
Do you HAVE to do it with assembly/intrinsics, or can you do it portably? — MarcusJ, Jan 03 '18 at 11:12
@PeterCordes: I think in `get 0 or -1` you can get away without `_mm256_setzero_si256` - by comparing with the `and` mask. https://gcc.godbolt.org/z/LBAz_a — Denis Yaroshevskiy, May 08 '20 at 16:38
That's correct, that's a missed optimization here. According to https://stackoverflow.com/posts/36491672/revisions (my answer on a duplicate of this), I didn't think of that until revision #6 in 2017, long after my last edit to this question. — Peter Cordes, May 08 '20 at 16:48

Fastest way to unpack 32 bits to a 32 byte SIMD vector

1 Answers1

To get 0 or -1 in each vector element:

Linked

Related