11

Having 32 bits stored in a uint32_t in memory, what's the fastest way to unpack each bit to a separate byte element of an AVX register? The bits can be in any position within their respective byte.

Edit: to clarify, I mean bit 0 goes to byte 0, bit 1 to byte 1. Obviously all other bits within the byte on zero. Best I could at the moment is 2 PSHUFB and having a mask register for each position.

If the uint32_t is a bitmap, then the corresponding vector elements should be 0 or non-0. (i.e. so we could get a vector mask with a vpcmpeqb against a vector of all-zero).

https://software.intel.com/en-us/forums/topic/283382

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
alecco
  • 2,914
  • 1
  • 28
  • 37
  • What language are you using? Is there some approach that you have tried that is too slow? – John Gietzen Jun 15 '14 at 02:10
  • C with Intel intrinsics. I tried the obvious approaches: broadcast the u32, then either variable shift or multiplication to shift each u32. But it starts to get complicated and would need several registers for mask. Then merge. I think I saw something similar a couple of years ago, in some assembly for video codecs or something. – alecco Jun 15 '14 at 02:15
  • 2
    Broadast first. With AVX2 then use _mm256_and_si256. With AVX you need to split the lanes, do _mm_and_si128 twice, then join high and low. – Z boson Jun 15 '14 at 11:26
  • 1
    @alecco, I posted an answer to do this with AVX. It would be a bit simpler with AVX2. – Z boson Jun 16 '14 at 11:31
  • 1
    AVX512BW: `VPMOVM2B ymm1, k1`: sets each byte of `ymm1` to 0 or -1, according to the corresponding bit in `k1`. If the mask wasn't already in a mask register, then you also need a `KMOVD k1, k2/m32` or `KMOVD k1, r32`. Obviously you can do this with 64bit masks into 512b zmm registers, too. – Peter Cordes Feb 25 '16 at 01:55
  • AVX2 duplicate (with the same answer which looks optimal): https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb. Maybe leaving this open for the 128b AVX version. – Peter Cordes Nov 19 '17 at 06:32

1 Answers1

16

To "broadcast" the 32 bits of a 32-bit integer x to 32 bytes of a 256-bit YMM register z or 16 bytes of a two 128-bit XMM registers z_low and z_high you can do the following.

With AVX2:

__m256i y = _mm256_set1_epi32(x);
__m256i z = _mm256_shuffle_epi8(y,mask1);
z = _mm256_and_si256(z,mask2);

Without AVX2 it's best to do this with SSE:

__m128i y = _mm_set1_epi32(x);      
__m128i z_low  = _mm_shuffle_epi8(y,mask_low);
__m128i z_high = _mm_shuffle_epi8(y,mask_high); 
z_low  = _mm_and_si128(z_low ,mask2);
z_high = _mm_and_si128(z_high,mask2);

The masks and a working example are shown below. If you plan to do this several times you should probably define the masks outside of the main loop.

#include <immintrin.h>
#include <stdio.h>

int main() {
    int x = 0x87654321;

    static const char mask1a[32] = {
        0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00,
        0x01, 0x01, 0x01, 0x01,
        0x01, 0x01, 0x01, 0x01,
        0x02, 0x02, 0x02, 0x02,
        0x02, 0x02, 0x02, 0x02,
        0x03, 0x03, 0x03, 0x03,
        0x03, 0x03, 0x03, 0x03
    };

    static const char mask2a[32] = {
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
        0x01, 0x02, 0x04, 0x08,
        0x10, 0x20, 0x40, 0x80,
    };

char out[32];

#if defined ( __AVX2__ )
    __m256i mask2 = _mm256_loadu_si256((__m256i*)mask2a);
    __m256i mask1  = _mm256_loadu_si256((__m256i*)mask1a);

    __m256i y =    _mm256_set1_epi32(x);
    __m256i z =    _mm256_shuffle_epi8(y,mask1);
    z = _mm256_and_si256(z,mask2);

    _mm256_storeu_si256((__m256i*)out,z);

#else
    __m128i mask2 = _mm_loadu_si128((__m128i*)mask2a);
    __m128i mask_low  = _mm_loadu_si128((__m128i*)&mask1a[ 0]);
    __m128i mask_high = _mm_loadu_si128((__m128i*)&mask1a[16]);    

    __m128i y = _mm_set1_epi32(x); 
    __m128i z_low  = _mm_shuffle_epi8(y,mask_low);
    __m128i z_high = _mm_shuffle_epi8(y,mask_high);
    z_low  = _mm_and_si128(z_low,mask2);
    z_high = _mm_and_si128(z_high,mask2);

    _mm_storeu_si128((__m128i*)&out[ 0],z_low);
    _mm_storeu_si128((__m128i*)&out[16],z_high);
#endif
    for(int i=0; i<8; i++) {
        for(int j=0; j<4; j++) {        
            printf("%x ", out[4*i+j]);
        }printf("\n");
    } printf("\n");
}

To get 0 or -1 in each vector element:

It takes one extra step _mm256_cmpeq_epi8 against all-zeros. Any non-zero turns into 0, and zero turns into -1. If we don't want this inversion, use andnot instead of and. It inverts its first operand.

__m256i expand_bits_to_bytes(uint32_t x)
{
    __m256i xbcast = _mm256_set1_epi32(x);    // we only use the low 32bits of each lane, but this is fine with AVX2

    // Each byte gets the source byte containing the corresponding bit
    __m256i shufmask = _mm256_set_epi64x(
        0x0303030303030303, 0x0202020202020202,
        0x0101010101010101, 0x0000000000000000);
    __m256i shuf  = _mm256_shuffle_epi8(xbcast, shufmask);

    __m256i andmask  = _mm256_set1_epi64x(0x8040201008040201);  // every 8 bits -> 8 bytes, pattern repeats.
    __m256i isolated_inverted = _mm256_andnot_si256(shuf, andmask);

    // this is the extra step: compare each byte == 0 to produce 0 or -1
    return _mm256_cmpeq_epi8(isolated_inverted, _mm256_setzero_si256());
     // alternative: compare against the AND mask to get 0 or -1,
     // avoiding the need for a vector zero constant.
}

See it on the Godbolt Compiler Explorer.

Also see is there an inverse instruction to the movemask instruction in intel avx2? for other element sizes.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Could you give some idea on how you'd do it with AVX2? Thanks! – alecco Jun 16 '14 at 19:55
  • 1
    @alecco, I updated my answer showing how to do this with AVX2. I tested it as well. – Z boson Jun 17 '14 at 09:25
  • 1
    You're a star! Thanks a lot. Wish I could give you more upvotes. – alecco Jun 17 '14 at 14:00
  • 1
    If you want the resulting bytes to be 0 or -1 (so each bit of the mask expands to all bits of the vector byte), you need one more step. After the shuffle, use `andn` instead of `and` (inverting `y`). Then use a `_mm256_cmpeq_epi8` against a vector of all-zeros to invert again. – Peter Cordes May 26 '16 at 22:44
  • 2
    Also, I'd write mask2a as `_mm256_set1_epi64x(0x80'40'20'10'08'04'02'01)`. (The C++14 `'` separators for readability are totally optional.) To make it easy to select 128 vs. 256, you can use a `_mm_set1_epi64x()` and then the AVX2 version can use `_mm256_set_m128i(same,same)`. It all optimizes away at compile time. – Peter Cordes May 26 '16 at 22:58
  • For mask1, I'd highly recommend using `_mm_set` rather than a load. You absolutely don't want your function to compile into scalar immediate stores to the stack, and then a vector load! Using `_mm_set` for constants allows sharing of the constant between multiple uses, exactly like string literal merging (probably even using the same compiler logic). I don't see a nice way to write it more compactly, though, except with a CPP macro to repeat it's argument 8 times. – Peter Cordes May 26 '16 at 23:00
  • @PeterCordes, that's why I said " If you plan to do this several times you should probably define the masks outside of the main loop." – Z boson May 27 '16 at 09:08
  • @PeterCordes, BTW, feel free to edit my answer with your suggestion but I prefer that you append your solution and not change what I have written. Just a horizontal marker and write after that if you like. – Z boson May 27 '16 at 09:32
  • @Zboson: I wasn't worried about it being hoisted or not, I was saying that it will waste instructions copying the data onto the stack, instead of just loading from a static constant. Look at the asm from your version converted to a function taking an `int` arg, but still printing: https://godbolt.org/g/HfYeMd. Note the `vmovdqa YMMWORD PTR [rbp-144], ymm0`. It's storing the constant it just loaded from `.LC0`. You can get rid of this with `static const char []`, but that still defeats constant-pool merging if this function was inlined into multiple files. **Just use `_mm_set`** – Peter Cordes May 27 '16 at 15:36
  • 1
    There's [a duplicate of this that has the same the strategy](https://stackoverflow.com/questions/21622212/how-to-perform-the-inverse-of-mm256-movemask-epi8-vpmovmskb). (Also suggesting an OR with a mask that has one bit *unset* and `vpcmpeqb` against `set1(0xFF)`, but an all-zeros vector is slightly cheaper to than all-ones) Not sure if I should close it. I was looking for a non-AVX version to link. I guess this has a non-AVX2 version at least. – Peter Cordes Nov 19 '17 at 06:30
  • Do you HAVE to do it with assembly/intrinsics, or can you do it portably? – MarcusJ Jan 03 '18 at 11:12
  • @PeterCordes: I think in `get 0 or -1` you can get away without `_mm256_setzero_si256` - by comparing with the `and` mask. https://gcc.godbolt.org/z/LBAz_a – Denis Yaroshevskiy May 08 '20 at 16:38
  • 1
    That's correct, that's a missed optimization here. According to https://stackoverflow.com/posts/36491672/revisions (my answer on a duplicate of this), I didn't think of that until revision #6 in 2017, long after my last edit to this question. – Peter Cordes May 08 '20 at 16:48