You can just use SIMD on the whole bitset
object, if you know the object layout of your standard library.
Most implementations of std::bitset<>
make the obvious implementation choice that the object-representation of the whole bitset
object is just the bits, packed into contiguous bytes. (I'd be surprised if any mainstream real-world implementation wasn't like that, but there's no guarantee you can safely assume that.) Most of those choose to use an array of some integer type wider than a byte.
If we're talking about just the x86 compilers that implement Intel's intrinsics API, that's an smaller set of implementations.
In GNU libstdc++ at least, the lowest-address chunk hold bits 0..63, and so on. (So it's little-endian across chunks, and x86 is little-endian for the bytes within chunks.) And bitset[0]
is the low byte of the low word, i.e. load and and eax, 1
. It's possible that implementations might make different choices, like storing the bitset[0]
at the bottom of the highest-address chunk, big-endian style. That wouldn't line up with how x86 bt
/ bts
bitstring instructions index memory, but they're slow anyway so the main reason for not doing so is that it would be more work to turn a runtime-variable index into an address and bitmask or shift count.
If you want to try to non-portably take advantage of this, use _mm_loadu_si128
on the std::bitset
object itself, not on a bit-iterator that &bitset[0]
returns.
#include <bitset>
#include <immintrin.h>
// being a struct or class member isn't necessary, just a handy place to put an alignas()
// for example purposes.
struct foo {
alignas(32) std::bitset<384> bs; // 32-byte aligned, 48 bytes total.
// alignas(16) would be sufficient for what I'm doing with SSE2
int x, y; // with or without these, the struct size is a multiple of the alignment, thus 64B.
};
// beware that allocating this with new might not respect alignment before C++17
void bar(foo *pfoo)
{
char *bsp = (char*) &(pfoo->bs); // pointer to (the first byte of) the bitset
// as a char* so pointer math works in bytes.
// unfortunately load/store intrinsics require casting back to __m128i*
// until AVX-512 when Intel realized void* would be better.
__m128i v0 = _mm_load_si128( (__m128i*)bsp ); // aligned load of bits 0..127
__m128i v1 = _mm_loadu_si128( vb+3 ); // unaligned load of bits 24..152
v0 = _mm_and_si128(v0, v1);
_mm_store_si128(vb+16, v0); // aligned store at another alignment boundary
}
This compiles (with GCC11.2 on Godbolt) to the following asm:
bar(foo*):
movdqu xmm0, XMMWORD PTR [rdi+3] # unaligned load has to use movdqu
pand xmm0, XMMWORD PTR [rdi] # aligned load can fold into a memory operand even without AVX
movaps XMMWORD PTR [rdi+16], xmm0 # aligned store. (movaps is shorter than movdqa)
ret
With AVX, the compiler could have chosen to do a vmovdqa
load for v0
and use an unaligned memory source operand for vpand xmm0, xmm0, [rdi+3]
, but I compiled without -march=haswell
to demo the SSE advantage of being able to use aligned load intrinsics. (See also Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? re: tuning options in older GCC.)
You can even alignas(32) std::bitset<256> bs
to align that instance of the bitset by 32 bytes, allowing use of aligned load/store like _mm256_load_si256
instead of loadu
. There could still be other object in part of the last 32 bytes, if your bitset isn't a multiple of 256 bits, so don't assume it's just alignment padding you can step on. It wouldn't be thread-safe to do a non-atomic load/store of those bytes (e.g. if you're modifying the bits that are part of the bitset, and storing back the later bytes unchanged.)
Beware that allocating objects with more alignment than alignof(max_align_t)
(typically 16 in x86-64 implementations) is only well-supported with new
in C++17. Before that, alignas()
only Just Worked for static and automatic storage.
Reminder: nothing guarantees this is portable
But it will probably work, on a C++ implementation that isn't a DeathStation 9000.
If you can't / don't want to hand-roll your own bitmap, or don't want to use Alex's suggestion of itsy_bitsy which has a documented way to get at the data, then this hack might be worth it if you can't get your compiler to make efficient asm in a more portable way.
As long as your C++ library implements bitset with something like class bitset { private: _chunk_t _data[size]; }
or something like that, there's nothing undefined about messing with the object-representation via intrinsics. (GNU libstdc++ uses _WordT _M_w[_Nw];
)
Intrinsics are defined to safely alias any other data, just like char*
. GCC/clang implement this by defining them as may_alias
types. See Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
(This does bypass the normal public / private restrictions, though.)
If this somehow breaks with some future compiler version, that's your problem. I think it's unlikely that something would change normal std::bitset
implementations to not have their object representation just be the array of bits, though.
You can look at the asm for something like return pfoo->bs.to_ulong()
and see what it loads to check for set high bits (unfortunately not vectorizing the test), before loading the low chunk. That confirms the bits are where we expected. (See the Godbolt link).
If you do this, write a unit test that uses _mm_set_epi32(1,0,0,0)
or something and store that to the bitset, then make sure the one set bit is where you expect it to be, at bs[96]
. That way you'll detect if the implementation changes the layout of std::bitset<>
.
You could also use a static_assert
on the size. For a size like 256 bits, sizeof()
will be a constant 32 even across implementations that use char bits[32]
or uint64_t bigchunks[4]
. sizeof(std::bitset<129>)
could vary, though. But static_assert
won't catch differences in the order of the words or bits within a word.
If you can use C++20, then the unit test for bit order can also be put in static_assert
, as bitset
methods are constexpr
, and there's std::bit_cast
that can be used in compile time. Though in this case the unit test wouldn't be able to use SSE intrinsics, and will have to use plain C++ operations. You could use char*
operations to manipulate the object-representation of a std::bitset
the same way you would with intrinsics, though. Or even better, use std::bit_cast<>
which, shouldn't compile for types with a vtable or something, at least in a constexpr context. For example, Alex suggested https://godbolt.org/z/1advToGf5 in comments.
The very fact that std::bitset
operations will be constexpr
in C++20 probably rules out some insane implementation choices entirely.