Efficient bit-mapping: one-time computation?

Question

I have an array of up to 32 rows (often less) of uint32_t values that are essentially just bitmaps - so a total of 1024 bits maximum. The data in these <=32 uint32_ts changes frequently (multiple times/millisecond).

For a vectorized algorithm I'm running, I need to extract some of the data from these bitmaps and rearrange the bits into the format my algorithm expects. The positions of the data I need to extract is known at runtime and will not change during the run. I want to find a way - even if it's a bit computationally heavy, as long as it's not memory-heavy I can frontload it - to compute some mapping function that will enable me to quickly extract the data into the format my algorithm expects, so I can run it as frequently as possible to keep up with the data.

The mapping looks something like this, presented in uint8_t binary format below for simplicity.

Input data (nominally up to 32 uint32_ts, presented here as 3 uint8_ts):

abcd efgh
ijkl mnop
qrst uvwx

I want to extract some of these, and create something like the following output data:

00000adu
00000fgx

Here, I'm only extracting 6 of the 24 bits. As a note, the data I extract will always:

Be an even number of bits - I need pairs, for instance "af" and "dg" and "ux" above.
Each "pair" will always come from either the same row (uint32_t) or column (bit position within all 32 of the uint32_ts). I will never have some data come from the same row and other data come from the same column; it's one or the other for a given configuration.
The algorithm expects to operate on a pair of output uint32_ts. In the case where I extract more than 64 bits - and so the result won't fit in 2 uint32_ts - I'll have to expand into a pair of arrays of uint32_ts.

Obviously there's the naive approach, but it doesn't operate very quickly - I'm hoping to run this as many times a second as possible.

#define EXTRACT_PAIRS_COUNT 3
const uint8_t colExtractLeft[EXTRACT_PAIRS_COUNT] = {7, 4, 3}
const uint8_t colExtractRight[EXTRACT_PAIRS_COUNT] = {2, 1, 0}
const uint8_t rowExtract[EXTRACT_PAIRS_COUNT] = {0, 0, 2}

uint8_t input[3] = {abcdefgh, ijklmnop, qrstuvwx};

while (inputChanges) {
    uint8_t outputLeft = 0;
    uint8_t outputRight = 0;
    for (size_t i=0; i<EXTRACT_PAIRS_COUNT; i++) {
        outputLeft |= ((input[rowExtract[i]] >> colExtractLeft[i]) & 0x1) << i;
        outputRight |= ((input[rowExtract[i]] >> colExtractRight[i]) & 0x1) << i;
    }
    runAlgorithm(outputLeft, outputRight);
}

(Note that the #defines and const variables at the top of that example are just for demonstration/clarity, in reality they're calculated at runtime)

How can I efficnetly map bits into this format?

I'm in a C++ environment, but I'm definitely more familiar with C. I guess I'll go with C++ as it's a superset, right? — Helpful, Mar 16 '23 at 00:11
No, C++ is not a superset of C. There are C programs that cannot be compiled as C++ programs, and C programs that syntactically can be compiled as C++ programs but which do different things when treated as C++. — Brian61354270, Mar 16 '23 at 00:12
Well, there I go exposing my ignorance. Regardless, I am in a C++ environment - so I may use more c-esque idioms because that's what I'm used to, but I don't have to stick to just what's possible in C for this problem. — Helpful, Mar 16 '23 at 00:12
Can you use advanced bit manipulation primitives such as [_pext_u64](https://www.felixcloutier.com/x86/pext) — harold, Mar 16 '23 at 00:13
SSE/AVX added a lot of instructions such as the PEXT processor instruction added in BMI2 — user253751, Mar 16 '23 at 00:13
@harold Unlikely that they're implemented on the microcontrollers this is targeting, but if the compiler will polyfill (now I'm throwing in web terms - I hope it makes sense here... jack of many trades and master of none) and it's still performant that's probably fine. — Helpful, Mar 16 '23 at 00:14
@Helpful OK there's no point then, PEXT cannot be efficiently polyfilled as far as I know, the only way for it be efficient is if it's implemented in hardware. Does that microcontroller offer anything of use though? — harold, Mar 16 '23 at 00:16
There are a few different microcontroller targets, so while each one of them might have something unique and helpful, the overlap is likely nonexistent. — Helpful, Mar 16 '23 at 00:17
What about efficient multiplications, do you at least have those? (for tricks [such as this](https://stackoverflow.com/q/14547087/555045)) — harold, Mar 16 '23 at 00:19
That's liable to be present on a few, but definitely not on all. — Helpful, Mar 16 '23 at 00:25

score 1 · Accepted Answer · answered Mar 16 '23 at 09:38

Since you are working on a fixed number of bits, you can try std::bitset<N> which allocates N bits on automatic storage; since the size is a constexpr(compile-time constant) this class doesn't use dynamic allocation. std::vector<bool> on the other hand, allocates packed bits dynamically with runtime size and resizable storage(I guess you don't need this). The problem with std::bitset is that minimum size (sizeof) and alignment (alignof) are underspecified and platform dependent. On a 32 bit machine you might write:

alignas(std::uint32_t)
std::array<std::bitset<32>,32> bitmap;
bitmap[5][7]=true;
static_assert(sizeof(bitmap)==sizeof(std::uint32_t[32]));//on 32bit platform

But due to underspecification, on a 64 bit platform this might waste another 1024 bits:

static_assert(sizeof(bitmap)==sizeof(std::uint64_t[32]));//on 64bit platform

Another approach can be to define your own class in terms of std::bitset:

struct bitmap{
     std::bitset<1024> value;

     using index_pair=std::pair<std::uint32_t,std::uint32_t>;

     auto operator [](index_pair ij)
     {return value[(ij.first%32)*32+ij.second%32];};

     auto operator [](index_pair ij) const
     {return value[(ij.first%32)*32+ij.second%32];};
};

bitmap bm;
bm[{5u, 7u}]=true;

std::bitset has a tone of accessor functions. Thus, the second approach may need to redefine many of them. With C++23 it is possible to define and use the indexer much simpler:

    auto bitmap::operator [](this auto&& self, uint32_t i, uint32_t j);
   
bm[5u, 7u]=true;

Which is great syntax improvement for readability.

I'm gunna go ahead and give the accepted answer here; thanks for the input! I hadn't ever heard of the std::bitset, so that gives me a direction to go at least. Not much activity on the question, so thanks for the help. — Helpful, Mar 17 '23 at 04:27
Is there any reason to think that this will be more efficient than the original code? — harold, Mar 17 '23 at 04:34
@Helpful, `std::biset` has been around since C++98. But strangely enough, it has mostly been neglected. There are critics about some aspects of this type, but I believe it is a really handy tool. — Red.Wave, Mar 17 '23 at 15:37
@harold, the OP is not concerned about efficiency right now. The main advantage of this approach is immunity to common bit manipulation pitfalls. The interface of this std class is also much more readable than most manually defined classes. — Red.Wave, Mar 17 '23 at 15:42

Efficient bit-mapping: one-time computation?

1 Answers1