Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

Question

(Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object?, trying to use an MSVC-specific method on GCC's definition of __m128i. But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.)

I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking.

I continue getting the error "request for member ‘m128i_i8’ in ‘(my var name)', which is of non-class type ‘wirelabel {aka __vector(2) long long int}’" which I don't understand because I've included all the correct headers and it does recognize __m128i variables.

Note1: wirelabel is a typedef for __m128i i.e. there exists in a header

typedef __m128i wirelabel

Note2: The reason Note1 was used is explained in the following other question: tbb::cache_aligned_allocator: Getting "request for member...which is of non-class type" with __m128i. User error or bug?

Note3: I'm using the compiler g++

Note4: This following question doesn't answer mine but does discuss related information Why should you not access the __m128i fields directly?

I also know that there is a _mm_set_epi8 function but it requires you set all 8 bit sections at once and that is not an option for me currently.

The question the accepted answer answers:

Edit: I was asked for more specifics as to why I think I need to access each of the 16 8-bit parts of the __m128i object, and here is why: I have a bool array with size 'n*128' (n is a size_t) and I need to store these within an array of 'wirelabel' with size 'n'.

Now because wirelabel is just an alias/typedef (correct me if there is a difference) for __m128i, each of the 'n' indices of 128 bools can be stored in the 'wirelabel' array.

However, in order to do this I believe need to convert every 8-bits into its signed equivalent and store it in the correct 8bit index in each 'wirelabel' pointer in the array.

Possible duplicate of [tbb::cache\_aligned\_allocator: Getting "request for member...which is of non-class type" with \_\_m128i. User error or bug?](https://stackoverflow.com/questions/8529372/tbbcache-aligned-allocator-getting-request-for-member-which-is-of-non-clas) — Justin, Mar 13 '18 at 18:42
I'm assuming you are not using Visual C++, as it doesn't have a `__vector` keyword AFAICT — Justin, Mar 13 '18 at 18:43
The reason for Note1 is explained in Note2 no? would you like a clarification? if so just let me know. — z.karl, Mar 13 '18 at 21:25
@LightnessRacesinOrbit I'm adding an edit to answer your question. — z.karl, Mar 13 '18 at 22:16
`'bool' array with size 'n*128'` - what is underlying data type for this `n` sized array? — Severin Pappadeux, Mar 13 '18 at 22:59
@SeverinPappadeux I'm sorry, but I don't seem to understand your confusion. The array looks like bool *arrayname = malloc(n*128*sizeof(bool)) — z.karl, Mar 13 '18 at 23:02
This might also be of interest: To set or unset individual bits within an `__m128` , you can use the technique described in this [answer](https://stackoverflow.com/a/39595704). That answer is for 256 bit vectors, but the translation of the code to 128 bit vectors is straightforward. Unfortunately it requires AVX2 even for 128 bit vectors because of the variable shift `vpsllvd` instruction. Furthermore, it seems to me that `_mm_extract_epi8()` and `_mm_insert_epi8()` are probably relevant too, unless your data is contiguous in memory. — wim, Mar 14 '18 at 11:03

Peter Cordes · Accepted Answer · 2023-06-12T00:24:09.463

So your source data is contiguous? You should use _mm_load_si128 instead of messing around with scalar components of vector types.

Your real problem is packing an array of bool (1 byte per element in the ABI used by g++ on x86) into a bitmap. You should do this with SIMD, not with scalar code to set 1 bit or byte at a time.

pmovmskb (_mm_movemask_epi8) is fantastic for extracting one bit per byte of input. You just need to arrange to get the bit you want into the high bit.

The obvious choice would be a shift, but vector shift instructions compete for the same execution port as pmovmskb on Haswell (port 0). (http://agner.org/optimize/). Instead, adding 0x7F will produce 0x80 (high bit set) for an input of 1, but 0x7F (high bit clear) for an input of 0. (And a bool in the x86-64 System V ABI must be stored in memory as an integer 0 or 1, not simply 0 vs. any non-zero value).

Why not pcmpeqb against _mm_set1_epi8(1)? Skylake runs pcmpeqb on ports 0/1, but paddb on all 3 vector ALU ports (0/1/5). It's very common to use pmovmskb on the result of pcmpeqb/w/d/q, though.

#include <immintrin.h>
#include <stdint.h>

// n is the number of uint16_t dst elements
// We access n*16 bool elements from src.
void pack_bools(uint16_t *dst, const bool *src, size_t n)
{
     // you can later access dst with __m128i loads/stores

    __m128i carry_to_highbit = _mm_set1_epi8(0x7F);
    for (size_t i = 0 ; i < n ; i+=1) {
        __m128i boolvec = _mm_loadu_si128( (__m128i*)&src[i*16] );
        __m128i highbits = _mm_add_epi8(boolvec, carry_to_highbit);
        dst[i] = _mm_movemask_epi8(highbits);
    }
}

Because we want to use scalar stores when writing this bitmap, we want dst to be in uint16_t for strict-aliasing reasons. With AVX2, you'd want uint32_t. (Or if you did combine = tmp1 << 16 | tmp to combine two pmovmskb results. But probably don't do that.)

To deal with strict-aliasing issues if you want to access your mask bitmap with a different C type later, you could use memcpy for these stores, as shown in another Q&A.

This compiles into an asm loop like this (with gcc7.3 -O3, on the Godbolt compiler explorer)

.L3:
    movdqu  xmm0, XMMWORD PTR [rsi]
    add     rsi, 16
    add     rdi, 2
    paddb   xmm0, xmm1
    pmovmskb        eax, xmm0
    mov     WORD PTR [rdi-2], ax
    cmp     rdx, rsi
    jne     .L3

So it's not wonderful (7 fuse-domain uops -> front-end bottleneck at 16 bools per ~1.75 clock cycles). Clang unrolls by 2, and should manage 16 bools per 1.5 cycles.

Using a shift (pslld xmm0, 7) would only run at one iteration per 2 cycles on Haswell, bottlenecked on port 0. That's not a problem on Skylake and later; shifts can run on more ports so _mm_slli_epi32(v, 7) is good there, and avoids needing a vector constant. See also Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

Davislor · Answer 2 · 2018-03-13T23:21:26.640

0

Create an anonymous union containing a _m128i member and an array of the other type whose members you want to set. Type-punning is legal in C, and supported as an extension in g++, clang++ and MSVC. If you want to set individual bits, you can declare the other member as a struct of bitfields. The order of a bitfield is implementation-defined, but you’re using an Intel intrinsic anyway, so it’ll be little-endian.

edited Mar 13 '18 at 23:21

answered Mar 13 '18 at 23:16

Davislor

14,674
2
34
49

That was a super cool idea! I hadn't heard of anonymous unions in C before, or Type-punning for that matter. However, after some research into these subjects, this solution cannot work for me as though these variables are used dependently in this instance, later on they are used independently and having them in a union will cause one to overwrite old changes in the other that I need to keep. – z.karl Mar 14 '18 at 06:03
For *setting* a vector, there's no advantage to a union over `_mm_set_epi8(highest, ..., lowest);` (or `_mm_setr_epi8(lowest, ..., highest)`.) It might compile differently, but if it compiles to less efficient code than a union, that's a missed-optimization bug in the compiler. (It does happen so worth checking out; `_mm_set_epi8` is a LOT of elements so it sucks a lot.) But in @z.karl's case, it sounds like **the data is contiguous**, so simply `_mm_load_si128( (__m128i*)&bool_array[i])` – Peter Cordes Mar 14 '18 at 06:17
1

@PeterCordes The advantage of a `union` would be that you could set a single element by itself. I suspect that the type-punning code would copy to memory and back, while `_mm_set_epi8()` is meant to set the entire vector with a SSE instruction. But I would have to compile with `-S` and check the generated code. – Davislor Mar 14 '18 at 15:36
1

@z.karl Sorry it doesn’t suit your needs. It’s a good tool to have in your box. – Davislor Mar 14 '18 at 15:37
I wasn't considering the case of modifying a single member. Setting a single member might compile to `pinsrb` if you're lucky. But without SSE4, yeah, movdqa store / byte store / movdqa load is likely. For wider elements, you're more likely to get a `movd` + shuffle or something to merge data into a vector register. – Peter Cordes Mar 14 '18 at 22:22

Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

The question the accepted answer answers:

2 Answers2

Linked

Related