4

What is the most efficient and most elegant way to interpret a string of bytes in modern C++? My first naive attempt was to use a bit field. Here is an example that hopefully explains the purpose and the difficulty of the task:

union Data {
    uint8_t raw[2];
    struct __attribute__((packed)) {
        uint field1: 4, field2: 2, field3: 1, field4: 2;
        uint field5: 7;
    } interpreted;
};


int main() {
    static_assert(sizeof(Data) == 2);
    Data d{.raw{0x84, 0x01}};
    std::cout << d.interpreted.field1 << std::endl;
    std::cout << d.interpreted.field4 << std::endl;
    std::cout << d.interpreted.field5 << std::endl;
}

This approach is computationally efficient, but it is not portable, and the order of the fields in memory is difficult to predict.

Output on i386/gcc11:

4
3
0

The 4 from 0x84 ended up in field1, while field5 uses the least significant bit in 0x01. Is there a better way? Perhaps a solution that sacrifices some of processing efficiency for maintainability and portability?

timrau
  • 22,578
  • 4
  • 51
  • 64
user23952
  • 578
  • 3
  • 10
  • 3
    Just keep it simple and assign to each member in turn. – Jesper Juhl Apr 28 '23 at 15:51
  • 1
    What are the actual requirements here? You already present an answer to the title question, but then criticize it based on lack of object-representation portability. You will have such a portability issue with *any* approach that does what you have actually asked. So, is it really about interpreting the byte sequence, or is it really about mapping a struct to it? – John Bollinger Apr 28 '23 at 15:58
  • 1
    Your code has undefined behavior. C++ doesn't allow type punning through a union expect if all types are standard layout classes and they all share a common beginning sequence of members. – NathanOliver Apr 28 '23 at 15:58
  • 2
    The obvious approach for portability would be to not rely on any object representations at all and instead extract each value formed by a set of bits properly from the value of your byte array via arithmetic operators. Whether you then store them in individual variables or a class with bit fields is secondary. Anything relying on object representations can't be portable as mentioned in a previous comment. (And to avoid the type-punning UB there is `std::start_lifetime_as` since C++23.) – user17732522 Apr 28 '23 at 16:03
  • @NathanOliver thanks for clarifying why this code has undefined behavior. It may not be a problem if the platform requirements are known and checked, for example by `static_assert(__BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__); static_assert(__GNUC__ >= 11, "GCC version 11 or later is required.");` – user23952 Apr 28 '23 at 16:08
  • Indeed, per previous comments, relying on a specific object representation is problematic. The other options are apparently not as computationally efficient, but this is a secondary concern in my particular use case. – user23952 Apr 28 '23 at 16:13
  • This is basic serialization. – Mikel F Apr 28 '23 at 18:07
  • The code above makes sense if both the `raw` field and uint bitfields are interpreted as little-endian values. This way, [0x84, 0x01] is written in memory as [0010 0001, 1000 0000]. The bit-field members are laid out sequentially, `field5` falls on the boundary and gets the bits [... 1, 1 ...], that is [11]. This memory area is interpreted as an uint in the little-endian order (not changing the value in this case), which is a 0b11=3. – user23952 Apr 28 '23 at 18:55
  • 1
    The most significant problem is bit-field representation portability. If bit-ordering was not a problem, bit-fields were good tools to avoid logic errors due to bit manipulation. Some efforts have been made to provide alternatives, but no common practice I know of, and existing ones have readability issuse. Type punning is the next challenge; there are quite a few solutions for it (including ``, or `union` with `char` dialects); but the 1st problem is dominant here. – Red.Wave Apr 28 '23 at 19:20

1 Answers1

3

One problem is that union type punning is UB though some compilers may allow it. Another problem is that the way bit fields are structured is not UB but is implementation defined. That said, most compilers pack bit fields in the low part first and allow spanning. It's just not guaranteed but it should be defined by the compiler spec.

One way to safely and efficiently do this is with a separate function that returns a Data object using std::bit_cast and a test initially executed at runtime that checks the implementation and fails, perhaps by throwing an exception.

#include <cstdint>
#include <iostream>
#include <bit>

// 0000000'11'0'00'0100 { 0x84, 0x01 };
struct Data {
    uint16_t field1 : 4, field2 : 2, field3 : 1, field4 : 2;
    uint16_t field5 : 7;
};

Data to_Data(uint8_t(&a)[2]) {
    return std::bit_cast<Data>(a);
}

// returns true if imnplimentation is OK
// fails to compile if size(Data)!=2
bool test_Data_implimentation()
{
    uint8_t a[2]{ 0x84, 0x01 };
    Data d = std::bit_cast<Data>(a);
    return d.field1 == 4 && d.field4 == 3 && d.field5 == 0;
}

int main() {
    if (test_Data_implimentation())
        std::cout << "Implementation passes\n";
    else
        std::cout << "Implementation fails\n";
    uint8_t a[2]{ 0x84, 0x01 };
    Data d = to_Data(a);
    std::cout << d.field1 << std::endl;
    std::cout << d.field4 << std::endl;
    std::cout << d.field5 << std::endl;
    //4
    //3
    //0
}

I also made a constexpr, self executing lambda, that doesn't take up any runtime code by checking at compile time if bit fields are packed as this, while very common, is implementation defined. The advantage, aside from a compile time check, is that it doesn't add anything to the global (or local) namespace. Adding this to any function that is compiled will check the bit field implementation and little endian state of the compiler. I actually did this because it would up simplifying some decoding of ICC (international color consortium) profile structures that are defined as binary objects.

[]() {
    constexpr uint16_t i = 0b0000'0001'0000'1101;
    struct A {uint16_t i0 : 2, i1 : 3, i2 : 4, i3 : 7; };
    constexpr A a{ std::bit_cast<A>(i) };
    static_assert(a.i0 == 1 && a.i1 == 3 && a.i2 == 8 && a.i3 == 0);
}();

Quick note: Clang hasn't yet implemented constexpr bit_cast for bit fields. It's an outstanding bug. MSVC and GCC have. For those using MSVC, intelliense, which uses Clang, puts red squigles in some of the code but it still compiles just fine with MSVC.

doug
  • 3,840
  • 1
  • 14
  • 18
  • Thanks for sharing, I didn't know about `std::bit_cast`, and it looks like a more deliberate way to get the data into the Data d struct. I compiled the code and it works. Also, bit_cast checks that the input type and the output type has the same length, so, as you point out, an explicit check is no longer necessary. I work with [GCC, and it supports type punning through union](https://gcc.gnu.org/onlinedocs/gcc-13.1.0/gcc/Structures-unions-enumerations-and-bit-fields-implementation.html), but you code will work with all compliant compilers. – user23952 Apr 29 '23 at 15:45