0

I have some binary data that contains structures that are defined as follows:

s1:
a - 1B
b - 4B
c - 2B
d - 8B

s2:
a - 1B
b - 3B
c - 2B
d - 6B

It uses Big Endian byte order.

I parse s1 with the following code:

#include <endian.h>
#include <stdint.h>
#include <string.h>

struct [[gnu::packed]] s1 {
    uint8_t  a;
    uint32_t b;
    uint16_t c;
    uint64_t d;
};

void parse_s1(struct s1 *parsed, const unsigned char buf[static restrict sizeof(*parsed)])
{
    memcpy(parsed, buf, sizeof(*parsed));
    parsed->b = be32toh(parsed->b);
    parsed->c = be16toh(parsed->c);
    parsed->d = be64toh(parsed->d);
}

Which AFAIK, is defined behavior.

And for the second structure, I'm considering the following code, but am not sure if I may be hitting Undefined Behavior at some point (I was lucky that the fields are byte-aligned (i.e., no 3-bit field) and that the non-power-of-two byte fields are not consecutive, but a complete answer might want to consider what would happen if two non-power-of-two byte integers are consecutive (i.e., if d went before c)):

#include <endian.h>
#include <stdint.h>
#include <string.h>

struct [[gnu::packed]] s2 {
    uint8_t   a;
    uintmax_t b : 24;
    uint16_t  c;
    uintmax_t d : 48;
};

void parse_s2(struct s2 *parsed, const unsigned char buf[static restrict sizeof(*parsed)])
{
    memcpy(parsed, buf, sizeof(*parsed));
    parsed->b = be32toh(parsed->b) >> 8;
    parsed->c = be16toh(parsed->c);
    parsed->d = be64toh(parsed->d) >> 16;
}

I also assume this code is C++-compatible (or more specifically gnu++-compatible) (except for the actual prototype of the function, which in C++ would use __restrict__ and a pointer instead of a VLA).

Is the code above correct in both C and C++ (GNU dialects)? Or does it rely on Undefined Behavior?


EDIT:

The following is an answer to @Anaconda's comment, as this doesn't format well in a comment:

//bitfields.c

#include <stdint.h>
#include <stdio.h>

struct [[gnu::packed]] s2 {
    uint8_t   a;
    uintmax_t b : 24;
    uint16_t  c;
    uintmax_t d : 48;
};

int main(void)
{
    printf("%zu\n", sizeof(struct s2));
    return 0;
}

$ cc -Wall -Wextra -pedantic -std=c2x bitfields.c

$ ./a.out
12

It seems (experimentally) that GCC understands [[gnu::packed]] so that it compresses bitfields as much as it can (at least byte-wise; I don't deal with integer widths that aren't multiple of 8, so I'm fine with this).

I'm not sure if there's any UB here, but I guess I'd have to be really unlucky, and GCC really evil, to break the code above.

  • I don't know for mixed structures with some "normal" members and some bit-field members, but IIRC for pure bit-field structures the compiler is allowed to rearrange the members and change their order. – Some programmer dude Jul 26 '21 at 11:38
  • 3
    `beXtoh()` should be a no-op on big-endian. So `beXtoh()>>Y` looks wrong for a potential BE host. –  Jul 26 '21 at 11:42
  • 1
    @Someprogrammerdude Really? I believe to recall that neighbouring fields within need to remain neighbouring, compiler is only free to decide at which end of the underlying type it starts with (which still makes bit fields non-portable...). – Aconcagua Jul 26 '21 at 11:44
  • @Aconcagua That sounds more reasonable, I could easily have misunderstood. :) – Some programmer dude Jul 26 '21 at 11:45
  • @dratenik, Good catch! I'll write a wrapper be24toh() with the help of some preprocessor directives from here: https://stackoverflow.com/a/2100363/6872717 – alx - recommends codidact Jul 26 '21 at 11:46
  • 3
    Even with `packed` attribute `a` and `b` shouldn't get compressed and `b` should still occupy a whole `uintmax_t` on its own. If you really want to pack bits without gaps you might prefer `uintmax_t a : 8; uintmax_t b: 24; /* ... */` – note all members getting bit fields... – Aconcagua Jul 26 '21 at 11:47
  • @Aconcagua Will the compiler respect that `a` goes before `b`? Or may it choose them to be nighbouring but the other way around (i.e., `d`, `c`, `b`, `a`)? – alx - recommends codidact Jul 26 '21 at 12:21
  • @Aconcagua Also, I guess that to have `sizeof` return the size of the packed bits with no extra padding at the end, I'll have to think really well the type of each bitfield (i.e., `uint32_t a : 8; uint32_t b : 24; uint64_t c : 16; uint64_t d : 48;`), right? Which in more complex cases might not be possible... – alx - recommends codidact Jul 26 '21 at 12:25
  • 1
    @alx The order in which the compiler arranges the individual bit fields within the underlying type is implementation defined, so for `uint8_t a : 4; uint8_t b : 4;` the compiler can decide to place a into the less significant nibble or b, but needs to document how it does. If you rely *portably* on a specific order within the underlying type (not considering the machine's endianness), then you need to pack the bits *manually* by bit-shifting and masking. – Aconcagua Jul 26 '21 at 13:35
  • `sizeof(a)` would return the size of the underlying type in bytes, i. e. with `CHAR_BIT == 8` it would return 4 in your latest example/comment. `sizeof(a) + sizeof(b)` would then result in 8! There isn't *any* equivalent way to get the number of bits back! You can get the size of the entire struct, but to sum up sizes of members you need to know where a bitfield occupies a new type and only sum up these first ones. Again you are better off with manual bit packing then. – Aconcagua Jul 26 '21 at 13:42
  • @Aconcagua, please see the edited question, which I think proves incorrect some of your statements (the compiler will pack `b` even if its neighbours aren't bitfields) – alx - recommends codidact Jul 26 '21 at 14:19
  • When using the @ syntax, please make use of the feature with the tabulator key, which offers you a selection of available user names. I think you misspelled. – Yunnosch Jul 26 '21 at 14:21
  • Oops, you're right I meant @Aconcagua. Fixed – alx - recommends codidact Jul 26 '21 at 14:23

2 Answers2

2

Is the code above correct in both C and C++ (GNU dialects)? Or does it rely on Undefined Behavior?

Well, it's UB according to the standard.

According to GNU, all these details are "Determined by ABI".

According to Agner Fog's summary of x86 calling conventions [pdf],

Objects of structures and classes are stored by placing the data members consecutively in memory. Unused bytes may be inserted between elements and after the last element, if needed, for the sake of alignment. The requirements for alignment are ...

There's no language permitting reordering, nor special language about bitfields. So, that should broadly cover GNU on x86. If you want GNU on a different platform, either say what it is or look up its ABI doc.


NB. other language features - specifically pointers and C++ references - may work badly or not at all with misaligned and oddly-sized data members.

It's sometimes more efficient to deserialize into a normally-aligned structure the slow-looking way, operate on correctly-aligned normally-sized types, and re-serialize the result if necessary.

Useless
  • 64,155
  • 6
  • 88
  • 132
  • Thanks! Yes, I'm on x86_64. I don't know if this code will ever run on aarch64, or any other architecture, but I doubt it, and I hope the ABI isn't much different in this aspect, anyway. About efficiency: I parse the raw data into the struct so that I can later read it once or twice and store it (with some modifications) in a normally-aligned data structure. So it provides the benefit of allowing `memcpy()` + `beXtoh()` which are really nice and simple APIs, and then I can already use the struct members in the decoding part so that it's much more readable. – alx - recommends codidact Jul 26 '21 at 14:58
  • After some days, I came up with a happy idea, so I now have good looking code that doesn't use packed structures. It can be done nicely with `memcpy()` + `beXtoh()` element by element with some wrappers. (Non-standard functions like be24toh() are still needed.) – alx - recommends codidact Jul 29 '21 at 06:34
0

An experimental answer on x86_64 (just compile the same code on other architectures to test them):

// bitfields.c:

#define _DEFAULT_SOURCE
#include <endian.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>

struct [[gnu::packed]] s2 {
    uint8_t   a;
    uintmax_t b : 24;
    uint16_t  c;
    uintmax_t d : 48;
};

uint32_t be24toh(uint32_t be)
{
#if BYTE_ORDER == BIG_ENDIAN
    return be;
#elif BYTE_ORDER == LITTLE_ENDIAN
    return be32toh(be) >> 8;
#else
    #error "wtf"
#endif
}

uint64_t be48toh(uint64_t be)
{
#if BYTE_ORDER == BIG_ENDIAN
    return be;
#elif BYTE_ORDER == LITTLE_ENDIAN
    return be64toh(be) >> 16;
#else
    #error "wtf"
#endif
}

void parse_s2(struct s2 *s, const unsigned char *raw)
{
    memcpy(s, raw, sizeof(*s));
    s->b = be24toh(s->b);
    s->c = be16toh(s->c);
    s->d = be48toh(s->d);
}

int main(void)
{
    struct s2 s;
    unsigned char raw[sizeof(s)] = {1,2,3,4,5,6,7,8,9,0xA,0xB,0xC};

    parse_s2(&s, raw);

    puts("sizeof:");
    printf("s:   %zu\n", sizeof(s));
    printf("s.a: %zu\n", sizeof(s.a));
    printf("s.c: %zu\n", sizeof(s.c));

    puts("offsetof:");
    printf("s.a: %zu\n", offsetof(struct s2, a));
    printf("s.c: %zu\n", offsetof(struct s2, c));

    puts("contents:");
    printf("s.a: %#.2jx\n", (uintmax_t) s.a);
    printf("s.b: %#.6jx\n", (uintmax_t) s.b);
    printf("s.c: %#.4jx\n", (uintmax_t) s.c);
    printf("s.d: %#.12jx\n", (uintmax_t) s.d);
    return 0;
}

Results:

$ cc -Wall -Wextra -pedantic -std=c2x bitfields.c 
$ ./a.out 
sizeof:
s:   12
s.a: 1
s.c: 2
offsetof:
s.a: 0
s.c: 4
contents:
s.a: 0x01
s.b: 0x020304
s.c: 0x0506
s.d: 0x0708090a0b0c
$ c++ -Wall -Wextra -pedantic -std=c++20 bitfields.c 
$ ./a.out 
sizeof:
s:   12
s.a: 1
s.c: 2
offsetof:
s.a: 0
s.c: 4
contents:
s.a: 0x01
s.b: 0x020304
s.c: 0x0506
s.d: 0x0708090a0b0c

It works as expected on GCC 11 (x86_64) and doesn't trigger any warnings.