How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD

Question

I want to convert 8 bit integer to an array of size 8 with each value containing the bit value of an integer.

For example: I have int8_t x = 8; I want to convert this to int8_t array_x = {0,0,0,0,1,0,0,0};

This has to be done efficiently, since this calculation is part of signal processing block. Is there a efficient way to do this? I did check the blend the instruction. It didn't suit my requirement when having array elements of size 8-bit. development platform is AMD Ryzen.

Since the array is is only 64 bits long, I would use the BMI2 `pdep` instruction, or `_pdep_u64` — wim, Aug 30 '18 at 14:16
See also [How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?](https://stackoverflow.com/q/21622212/2439725). — wim, Aug 30 '18 at 14:37
@harold: not a duplicate after all; \@aqrit's nice answer that gets 0/1 with PAND / PMINUB doesn't apply there where you want 0/-1, and is a non-obvious optimization vs. using that and masking. — Peter Cordes, Aug 31 '18 at 00:18

aqrit · Answer 1 · 2018-08-30T22:05:13.590

3

"Inverse movemask" for a single byte with 0x00:0x01 formatted results, with SIMD but without BMI2.

__m128i v = _mm_set1_epi8(bitmap); 
v = _mm_and_si128(v, _mm_set_epi32(0, 0, 0x80402010, 0x08040201));
v = _mm_min_epu8(v, _mm_set1_epi8(1));
_mm_storel_epi64((__m128i*)&array_x[0], v);

edited Aug 30 '18 at 22:05

answered Aug 30 '18 at 21:52

aqrit

792
4
14

1

if you don't have BMI2 it's better to [do a multiplication technique](https://stackoverflow.com/a/51750902/995714) instead of a lot of SIMD instructions – phuclv Aug 31 '18 at 00:40

wim · Answer 2 · 2018-08-30T20:50:59.973

The first example at the end of this answer shows how to use the BMI2 instruction pdep to compute the 8 byte array.

Note that on Intel Haswell processors and newer, the pdep instruction has a throughput of one instruction per cycle and a latency of 3 cycles, which is fast. On AMD Ryzen this instruction is relatively slow unfortunately: both latency and throughput are 18 cycles. For AMD Ryzen it is better to replace the pdep instruction with a multiplication and a few bitwise operations, which are quite fast on AMD Ryzen, see the second example at the end of this answer.

See also here and here for efficient inverse movemask computations, with a scalar source and a 256 bit AVX2 vector destination.

Instead of working with 8 bits and 8 bytes at the time, it might be more efficient to reorganize your algorithm to work with 4 x 8 bits and 4 x 8 bytes per step. In that case the full AVx2 vector width of 256 bit can be utilized, which might be faster.

Peter Cordes shows that the pext instruction can be used for the conversion in the opposite direction: from 8 bytes to 8 bits.

Code example with the pdep instruction:

/*  gcc -O3 -Wall -m64 -march=skylake bytetoarr.c  */

#include<stdint.h>
#include<stdio.h>
#include<x86intrin.h>

int main(){

    int i;
    union {
        uint8_t  a8[8];
        uint64_t a64;
    } t;
    /*  With mask = 0b0000000100......0100000001 = 0x0101010101010101    */
    /*  the input bits 0, 1, ..., 7 are expanded                         */
    /*  to the right positions of the uint64_t = 8 x uint8_t output      */
    uint64_t mask = 0x0101010101010101;

    /* example input: */
    uint8_t x = 0b01001100;

    t.a64 = _pdep_u64(x,mask);

    for (i = 0; i < 8; i++){
        printf("a[%i] = %hhu\n", i, t.a8[i]);
    }
}

The output is:

$ ./a.out
a[0] = 0
a[1] = 0
a[2] = 1
a[3] = 1
a[4] = 0
a[5] = 0
a[6] = 1
a[7] = 0

Code example for AMD Ryzen processors:

/*  gcc -O3 -Wall -m64 -march=skylake bytetoarr_amd.c  */

#include<stdint.h>
#include<stdio.h>
#include<x86intrin.h>

int main(){

    int i;
    union {
        uint8_t  a8[8];
        uint64_t a64;
    } t;

    /* example input: */
    uint8_t  x    = 0b01001100;

    uint64_t x64  = x;                 
    uint64_t x_hi = x64 & 0xFE;                                                  /* Unset the lowest bit.                        */
    uint64_t r_hi = x_hi * 0b10000001000000100000010000001000000100000010000000; /* Copy the remaining 7 bits 7 times.           */
    uint64_t r    = r_hi | x64;                                                  /* Merge the lowest bit into the result.        */
             t.a64= r & 0x0101010101010101   ;                                   /* Mask off the bits at the unwanted positions. */

    for (i = 0; i < 8; i++){
        printf("a[%i] = %hhu\n", i, t.a8[i]);
    }
}

For future readers, note that `pdep` is not efficient on Ryzen (or Excavator which also has it, I think). Also, the mask is `0x0101010101010101` if you don't want a huge binary constant in your source. — Peter Cordes, Aug 30 '18 at 14:52
I did check pdep instructions. As @PeterCordes mentioned it is very expensive. In Ryzen it has 18 cycles of latency according to agner's manual. — yadhu, Aug 30 '18 at 16:29
@yadhu: But on Intel it's very efficient, especially if you want the result in an integer instead of in a SIMD vector. — Peter Cordes, Aug 30 '18 at 16:31
@PeterCordes Sometimes a code becomes more self explaining when you write constants as a binary. That is not the case here, I admit :-) . — wim, Aug 30 '18 at 17:43

How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD

2 Answers2

Linked