std::bitset and SSE instructions

Question

Is it possible to use SSE instructions on the underlying data of a std::bitset? The bitsets I am working with are larger than unsigned long, so the to_ulong() method will not suffice. For instance, can I use an instruction like this:

__m128i* ptr= (__m128i*)(&my_bitset[0]);

Then perform SSE operations per normal?

I have tried to search the internet quite a bit for people using std::bitset with SSE and it doesn't appear to be a common use case.

Did you check that your compiler (with the appropriate flags) doesn't already generate SSE instructions for bitsets? — celtschk, Jun 07 '13 at 19:04
It does for some operations. But I am repeatedly setting bits with a test that is essential `my_bitset[i] = a > b` which could easily be done with sse ops on __m128i (it doesn't look like the compiler generates sse ops for this case) — kip622, Jun 07 '13 at 19:18
my_bitset[0] is not a reference to the array but a proxy for a bit, you can't do this cast. Could you be more precise about the operation you want to vectorize? — Marc Glisse, Jun 27 '13 at 01:07
@MarcGlisse: SSE `PMOVMSK` comes to mind, to vectorize `for (i = 0; i < numbits; i++) bitset[i] = (a[i] < b[i]);` - as you can bunch this up into `PCMPGT`/`PMOVMSK`. — FrankH., Aug 02 '13 at 08:49
`my_bitset[0]` will return a proxy-object, so taking its address will not help. You may have luck directly casting `&my_bitset` to `int*` or `__m128i*`. But I assume the internals of a bitset are implementation defined. — chtz, Jun 10 '17 at 16:43

Maxim Egorushkin · Answer 1 · 2021-11-10T01:47:57.393

Is it possible to use SSE instructions on the underlying data of a std::bitset?

In

__m128i* ptr= (__m128i*)(&my_bitset[0]);

my_bitset[0] returns a temporary proxy object of unspecified layout, which contains a pointer to the container/storage and the bit index (e.g. GNU C++ std::bitset::reference implementation) . Casting a pointer to this temporary proxy object to __m128i* would be meaningless. But C++ doesn't allow taking addresses of temporary objects at all, so that &my_bitset[0] results in a compiler error.

std::bitset may use SIMD instructions for its member functions automatically if/when the compiler chooses to vectorize it.

In this example, gcc decided to use AVX-256 instructions, whereas clang decided not to. Both choices aren't ideal:

gcc generated AVX instructions with 256-bit ymm registers, which reduce CPU frequency on older Intel CPUs (or crash overclocked ones with forced AVX offset of 0). But the vector size is too small to justify paying the price of increased CPU power consumption and possibly lower frequency when using sporadic ymm register instructions here and there.
clang generated 64-bit general purpose register instructions, which take more instruction bytes and more loads/stores, than SSE with 128-bit xmm registers would. CPUs can only perform a fixed number of load/store instructions (not bytes) per cycle, so it makes sense to maximize the amount of data loaded and stored per one instruction.

The ideal choice in this example may be to use SSE instructions with 128-bit xmm registers - minimize the number of load/store instructions without downclocking the CPU. Which goes to show that compiler vectorization is often not ideal.

std::bitset, unfortunately, doesn't provide direct access to its storage, and any access to it by a C-style cast may result in undefined behavior without a warning or error due to layout, alignment or strict aliasing violation.

std::bitset is unlikely to use any non-standard/SIMD type for its storage because of portability constraint, so that casting its storage to a wider SIMD type pretty much guarantees alignment and strict aliasing violation. There are non-portable ways to work-around that, but that is brittle against future changes and that's why I cannot recommend going this way.

You may like to look for other containers designed with SIMD in mind, such as Vc: portable, zero-overhead C++ types for explicitly data-parallel programming. It allows to choose the SIMD instruction type to use on per-container-class basis, e.g. you may only like to use 128-bit xmm registers instructions for this particular container type, even if 256-but ymm registers are available.

gcc and clang both support Using Vector Instructions through Built-in Functions on types declared with __attribute__((vector_size (N))), which is another way:

Currently, GCC allows using the following operators on these types: +, -, *, /, unary minus, ^, |, &, ~, %.

But these don't allow choosing the underlying SIMD type/instructions on per-container-class basis, only per object file with compiler options like -mno-avx.

When AVX2 was new (Haswell), *server* chips tended to have some frequency penalty for using 256-bit vectors. These days it's pretty minor, with only 512-bit vectors paying a significant frequency penalty on Skylake-X, and on IceLake not much penalty even for AVX-512, at least on client chips. See discussion on https://reviews.llvm.org/D111029 about clang enabling 512-bit vectorization at least for `-march=icelake-client`, inspired by a table of Ice-Lake client frequencies on https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html. — Peter Cordes, Nov 10 '21 at 00:40
(*Any* frequency transition at all costs significant latency when it happens, so you wouldn't want it just for something that happens once every 200 ms or so, but glibc functions like memcpy, memcmp, strlen, and so on use 256-bit vectors when available so most programs will have some consistent usage of AVX2.) — Peter Cordes, Nov 10 '21 at 00:41
*casting its storage to a wider SIMD type pretty much guarantees alignment and strict aliasing violation* - Of course you'd use `_mm256_loadu_si256`, not `load`, unless you did `alignas(32) std::bitset<256>`. Intrinsic load/store, and even raw dereference of a `__m256i*` pointer, are safe against strict-aliasing, though, as guaranteed by Intel's intrinsics API. GCC/clang implement that by declaring `__m256i` as a `may_alias` type: [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) — Peter Cordes, Nov 10 '21 at 00:46
Re: AVX / AVX-512 impact on CPU frequency see [SIMD instructions lowering CPU frequency](https://stackoverflow.com/a/56861355). **Using "light" 256-bit instructions will never cause down-clocking on Skylake, and that includes all AVX2 integer YMM instructions except multiplies.** — Peter Cordes, Nov 10 '21 at 00:49
Discussion in comments on that linked Q&A indicates that probably even Haswell only clocks down for FP YMM instructions (or integer multiplies which also use those execution units). So I think the clock-speed concern is probably a non-issue. — Peter Cordes, Nov 10 '21 at 01:46

Alex Guteniev · Answer 2 · 2021-11-10T05:52:21.933

bitset does not have a standard way to access its internal data.

There's itsy_bitsy library that provides an interface similar to bitset to other data. bit_view is what you need, it wraps data with ability to manipulate bits, but without insert/erase operations.

Not sure if you can have bitsy::bit_view directly on __m128i type, but it supports like bitsy::bit_view<std::span<char>>, so you can have __m128i variable(s) and reinterpret it as a span of chars,

Peter Cordes · Answer 3 · 2021-11-10T08:24:37.887

You can just use SIMD on the whole `bitset` object, if you know the object layout of your standard library.

Most implementations of std::bitset<> make the obvious implementation choice that the object-representation of the whole bitset object is just the bits, packed into contiguous bytes. (I'd be surprised if any mainstream real-world implementation wasn't like that, but there's no guarantee you can safely assume that.) Most of those choose to use an array of some integer type wider than a byte.

If we're talking about just the x86 compilers that implement Intel's intrinsics API, that's an smaller set of implementations.

In GNU libstdc++ at least, the lowest-address chunk hold bits 0..63, and so on. (So it's little-endian across chunks, and x86 is little-endian for the bytes within chunks.) And bitset[0] is the low byte of the low word, i.e. load and and eax, 1. It's possible that implementations might make different choices, like storing the bitset[0] at the bottom of the highest-address chunk, big-endian style. That wouldn't line up with how x86 bt / bts bitstring instructions index memory, but they're slow anyway so the main reason for not doing so is that it would be more work to turn a runtime-variable index into an address and bitmask or shift count.

If you want to try to non-portably take advantage of this, use _mm_loadu_si128 on the std::bitset object itself, not on a bit-iterator that &bitset[0] returns.

#include <bitset>
#include <immintrin.h>

// being a struct or class member isn't necessary, just a handy place to put an alignas()
// for example purposes.
struct foo {
 alignas(32) std::bitset<384> bs;  // 32-byte aligned, 48 bytes total.
           // alignas(16) would be sufficient for what I'm doing with SSE2
 int x, y;                 // with or without these, the struct size is a multiple of the alignment, thus 64B.
};
  // beware that allocating this with  new  might not respect alignment before C++17


void bar(foo *pfoo)
{
    char *bsp = (char*) &(pfoo->bs);   // pointer to (the first byte of) the bitset
      // as a char* so pointer math works in bytes.
      // unfortunately load/store intrinsics require casting back to __m128i*
      // until AVX-512 when Intel realized void* would be better.

    __m128i v0 = _mm_load_si128( (__m128i*)bsp );   // aligned load of bits 0..127
    __m128i v1 = _mm_loadu_si128( vb+3 );   // unaligned load of bits 24..152 
    v0 = _mm_and_si128(v0, v1);
    _mm_store_si128(vb+16, v0);            // aligned store at another alignment boundary
}

This compiles (with GCC11.2 on Godbolt) to the following asm:

bar(foo*):
        movdqu  xmm0, XMMWORD PTR [rdi+3]    # unaligned load has to use movdqu
        pand    xmm0, XMMWORD PTR [rdi]      # aligned load can fold into a memory operand even without AVX
        movaps  XMMWORD PTR [rdi+16], xmm0   # aligned store.  (movaps is shorter than movdqa)
        ret

With AVX, the compiler could have chosen to do a vmovdqa load for v0 and use an unaligned memory source operand for vpand xmm0, xmm0, [rdi+3], but I compiled without -march=haswell to demo the SSE advantage of being able to use aligned load intrinsics. (See also Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? re: tuning options in older GCC.)

You can even alignas(32) std::bitset<256> bs to align that instance of the bitset by 32 bytes, allowing use of aligned load/store like _mm256_load_si256 instead of loadu. There could still be other object in part of the last 32 bytes, if your bitset isn't a multiple of 256 bits, so don't assume it's just alignment padding you can step on. It wouldn't be thread-safe to do a non-atomic load/store of those bytes (e.g. if you're modifying the bits that are part of the bitset, and storing back the later bytes unchanged.)

Beware that allocating objects with more alignment than alignof(max_align_t) (typically 16 in x86-64 implementations) is only well-supported with new in C++17. Before that, alignas() only Just Worked for static and automatic storage.

Reminder: nothing guarantees this is portable

But it will probably work, on a C++ implementation that isn't a DeathStation 9000.

If you can't / don't want to hand-roll your own bitmap, or don't want to use Alex's suggestion of itsy_bitsy which has a documented way to get at the data, then this hack might be worth it if you can't get your compiler to make efficient asm in a more portable way.

As long as your C++ library implements bitset with something like class bitset { private: _chunk_t _data[size]; } or something like that, there's nothing undefined about messing with the object-representation via intrinsics. (GNU libstdc++ uses _WordT _M_w[_Nw];)

Intrinsics are defined to safely alias any other data, just like char*. GCC/clang implement this by defining them as may_alias types. See Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
(This does bypass the normal public / private restrictions, though.)

If this somehow breaks with some future compiler version, that's your problem. I think it's unlikely that something would change normal std::bitset implementations to not have their object representation just be the array of bits, though.

You can look at the asm for something like return pfoo->bs.to_ulong() and see what it loads to check for set high bits (unfortunately not vectorizing the test), before loading the low chunk. That confirms the bits are where we expected. (See the Godbolt link).

If you do this, write a unit test that uses _mm_set_epi32(1,0,0,0) or something and store that to the bitset, then make sure the one set bit is where you expect it to be, at bs[96]. That way you'll detect if the implementation changes the layout of std::bitset<>.

You could also use a static_assert on the size. For a size like 256 bits, sizeof() will be a constant 32 even across implementations that use char bits[32] or uint64_t bigchunks[4]. sizeof(std::bitset<129>) could vary, though. But static_assert won't catch differences in the order of the words or bits within a word.

If you can use C++20, then the unit test for bit order can also be put in static_assert, as bitset methods are constexpr, and there's std::bit_cast that can be used in compile time. Though in this case the unit test wouldn't be able to use SSE intrinsics, and will have to use plain C++ operations. You could use char* operations to manipulate the object-representation of a std::bitset the same way you would with intrinsics, though. Or even better, use std::bit_cast<> which, shouldn't compile for types with a vtable or something, at least in a constexpr context. For example, Alex suggested https://godbolt.org/z/1advToGf5 in comments.

The very fact that std::bitset operations will be constexpr in C++20 probably rules out some insane implementation choices entirely.

If going with the assumption of `bitset` layout, suggest adding `static_assert` for the size. The same size as specified as `bitset` template parameter would indicate that there's no vtable pointer or extra data members. Failure of this assert would mean that this `bitset` is odd, though not necessarily mean that casts would not work, as extra members may follow actual storage, in this case, the cast still works. — Alex Guteniev, Nov 10 '21 at 06:56
@AlexGuteniev: Thanks, good suggestion as a way to rule out extra members. It doesn't rule out changes in translating the bit-index to a byte offset and so on, though, which `_mm_storeu_si128` would. So it's good to use both. — Peter Cordes, Nov 10 '21 at 07:09
@AlexGuteniev: Interesting idea to use static_assert with constexpr tests. You *could* use `char*` instead of `__m128*` / `_mm_storeu_si128` because both are strict-aliasing safe. It's well defined what intrinsics will do to bits/bytes, so the only question is about the exact layout and indexing of the `std::bitset` object. — Peter Cordes, Nov 10 '21 at 07:40
`bit_cast` creates _copies_, so it is not even subject to strict-aliasing rule. In [cppreference example](https://en.cppreference.com/w/cpp/numeric/bit_cast#Example) it is shown to cast directly from `double` to `std::uint64_t`, and if `double` is 8 bytes, this code is valid. Ability to `bit_cast` also rules out pointers/vtable ptrs in the implementation, `bit_cast` cannot cast pointers or types containing them. — Alex Guteniev, Nov 10 '21 at 08:08

Алексей Неудачин · Answer 4 · 2021-11-10T07:14:23.100

0

it's np to do really. you need boost::dynamic_bitset<> and this stuff

https://www.generacodice.com/en/articolo/882721/extract-subset-from-boost-dynamic-bitset

last part

what you wanna is to grab

dynamic_bitset::m_bits

edited Nov 10 '21 at 07:14

answered Nov 10 '21 at 07:00

Алексей Неудачин

405
4
13

Caveat: `boost::dynamic_bitset` would use dynamic memory allocation that is unlikely to optimize away. It can kill the performance if created in a loop. – Alex Guteniev Nov 10 '21 at 07:02
@AlexGuteniev that's may be – Алексей Неудачин Nov 10 '21 at 07:05
it's 0.01ms vs 110ns for `std::bitset`. i doubt it'll hit performance. as a last chance weapon sse2 - sse4 has all the stuff – Алексей Неудачин Nov 10 '21 at 16:53

std::bitset and SSE instructions

4 Answers4

You can just use SIMD on the whole bitset object, if you know the object layout of your standard library.

Reminder: nothing guarantees this is portable

You can just use SIMD on the whole `bitset` object, if you know the object layout of your standard library.