How safe is casting a struct to uint8_t * or char * and accessing it via bytestream?

Question

The following logic works fine but I'm uncertain of the caveats with what the standard says and whether it's totally safe to cast a struct to uint8_t * or char * to send to a message queue (which itself takes in a pointer to the buffer as well) or even a function?

My understanding is as long as uint8_t is considered a byte (which char is), it could be used to address any set of bytes

typedef struct
{
    uint8_t a;
    uint8_t b;
    uint16_t c;
} } __attribute__((packed)) Pkt;


int main() 
{
    Pkt pkt = {.a = 4, .b = 12, .c = 300};
    mq_send(mq, (char *) &pkt, sizeof(pkt), 0);
}

Perhaps it's similar to passing a cast pointer to a function (on the receiver end), and it's parsing the data according to bytes

typedef struct
{
    uint8_t a;
    uint8_t b;
    uint16_t c;
}  __attribute__((packed)) Pkt;

void foo(uint8_t *ptr)
{
    uint8_t a = *ptr++;
    uint8_t b = *ptr++;
    uint16_t str =  (*(ptr+1) << 8) | *ptr;
    printf ("A: %d, B: %d, C: %d\n", a, b, str);
}

int main() 
{
    Pkt pkt = {.a = 4, .b = 12, .c = 300};   
    foo((uint8_t *) &pkt);
}

"safe to cast a struct to uint8_t * or char *" --> yes. always. (unless `char` is _signed_ non-2's complement - unheard of these days.) — chux - Reinstate Monica, Oct 31 '21 at 01:54
Yes, but `char` is always a _byte_ too, even when it is not 8-bit. In C, a _byte_ is not always 8-bit, yet that is uncommon today. — chux - Reinstate Monica, Oct 31 '21 at 01:55
if `char` isn't always 8-bit, wouldn't it cause issues with accessing members that are of `uint8_t`? — xyf, Oct 31 '21 at 01:57
Yet the whole goal is dubious, why `mq_send(mq, (char *) &pkt, sizeof(pkt), 0);` vs `mq_send(mq, (void *) &pkt, sizeof(pkt), 0);`? — chux - Reinstate Monica, Oct 31 '21 at 01:57
When `char` is not 8-bit, the _optional_ type `uint8_t` does not exist, so `uint8_t` members do not exist. — chux - Reinstate Monica, Oct 31 '21 at 01:57
okay that's fine. So casting a struct to `uint8_t *` or `char *` is totally safe since they represent the smallest addressable byte which could be used to identify the data accordingly? — xyf, Oct 31 '21 at 02:01
Two comments: (1) Packed struct is not covered by the standard, thus the standard cannot cover that part. (2) The function `foo` assumes that the `uint16_t` is stored little-endian. This is not necessarily the case (i.e. not specified by the standard). — nielsen, Oct 31 '21 at 02:01
To be very pedantic I think `[unsigned] char` is safer. `char` and `unsigned char` have a special exemption from the [strict aliasing rule](https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule) which technically `uint8_t` does not, even if they're the same size. — Nate Eldredge, Oct 31 '21 at 02:40
@xyf - Yes. "casting a struct to uint8_t * or char *" for all pointer types, except function pointers. — chux - Reinstate Monica, Oct 31 '21 at 05:20
@nielsen "The function foo assumes that the uint16_t is stored little-endian." --> Yes and no. With either endian, the function "works", just differently. — chux - Reinstate Monica, Oct 31 '21 at 05:22
@NateEldredge Hmmm, I disagree - `uint8_t` would also have same AA characteristics as `unsigned char`, yet I'll leave deeper discussion for another question. — chux - Reinstate Monica, Oct 31 '21 at 05:31
@chux-ReinstateMonica: I think the point is that the exception in the strict aliasing rule is for "character types", which are defined as `char, signed char, unsigned char` and nothing else. Most common implementations probably do `typedef unsigned char uint8_t;` so that `uint8_t` also gets the benefit, but they don't necessarily have to. Your DeathStation 9000 could `typedef __internal_8_bit_type_that_is_not_a_character uint8_t;` and assume `uint8_t` can't alias other types. — Nate Eldredge, Oct 31 '21 at 05:55
@NateEldredge Disagree with "and nothing else." given "a type compatible with the effective type of the object," (C17dr § 6.5 7). Perhaps post a question? — chux - Reinstate Monica, Oct 31 '21 at 06:01
@chux-ReinstateMonica: C 2018 6.2.5 15 clearly defines the *character types* as `char`, `signed char`, and `unsigned char` and nothing else. `uint8_t` can be an alias for `unsigned char`, in which case it is `unsigned char` because it is simply a different name for the same thing. However, it can also be a *extended integer type*, as discussed in 6.2.5 4, 6, and 7. There is simply nothing in the C standard that definitely connects `uint8_t` to a type that may be used to alias any object. — Eric Postpischil, Oct 31 '21 at 08:41
@chux-ReinstateMonica About endianness in `foo()`: That is not correct. The value of the `uint16_t`, `str`, will be 256*[byte1]+byte0 no matter if the *receiver* is little or big endian, but this is not the correct value if the *sender* has stored the value in big endian. — nielsen, Oct 31 '21 at 08:48
@nielsen The code is correct that with either endian the function is well behaved - no UB. True that with different endians, different values result, yet `foo()` does not specify what `uint16_t str = (*(ptr+1) << 8) | *ptr;` should produce nor is sender code posted. OP's concern is "safe is casting a struct to uint8_t * or char *", not about functionality per endian. — chux - Reinstate Monica, Oct 31 '21 at 14:41
By the way, `(*(ptr+1) << 8) | *ptr;` may have undefined behavior. `*(ptr+1)` will be promoted to `int`. If that is 16 bits and the high bit (bit 7) of the byte is set, then `*(ptr+1) << 8` overflows `int`, and the behavior is not defined by the C standard. This can be fixed with `(uint16_t) ptr[1] << 8 | ptr[0]`. — Eric Postpischil, Oct 31 '21 at 15:00
@chux-ReinstateMonica: Yes, by "the exception" I mean only the last bullet of 6.5.7. A character type may freely alias every object, regardless of the object's effective type. No other type has that general privilege, and as Eric says, no type other than `char, signed char, unsigned char` is a character type. I feel pretty clear on this and don't feel the need to post a new question, but you are welcome to. I feel like I've seen it brought up here on SO before though. — Nate Eldredge, Oct 31 '21 at 17:20

score 1 · Answer 1 · answered Oct 31 '21 at 08:35

C deliberately allows accessing the bytes of an object and supports communicating objects by transmitting the bytes that represent them and reconstructing them from the transmitted bytes. However, it should be done correctly, and there are some issues to deal with.

A character type should be used.

The preferred type to work with is unsigned char. This is preferred for two reasons:

The C standard defines the behavior of using character types to access the representations of objects. The character types are char, signed char, and unsigned char. The standard does not require that uint8_t be a character type. Although it may have the same size and general properties of unsigned char, it may be an extended integer type rather than an alias of unsigned char (or of char). In this case, the C standard does not define the behavior of accessing the bytes of an object with uint8_t.
unsigned char is preferred over char or signed char to avoid problems with signed integers in various C operations.

The sender and the receiver must agree on the representations of the objects or the protocol used for sending them.

If the sender and the receiver are compiled with the same C implementation using the same definitions for the objects being transmitted (such as the same structure definitions), they will agree on the representations. Between diverse C implementations, though, it is necessary to ensure there is clear agreement on how the transmitted bytes represent objects. As shown in your code, the structure is packed, which should take care of the problem that there may be padding inside structures. Other considerations include:

The order of bytes within integers. Little-endian-first (bytes in order from least significant to most) and big-endian-first (the reverse) are common, although others are possible. Big endian is most common in network protocols.
Representations of non-integers, such as floating-point formats. The IEEE-754 floating-point standard specifies some interchange formats, which are very widely used.
Structures are layed out identically, including the types of members.
Theoretically, the order of bits within bytes must be agreed, but this is not an issue if the network service is operating at the byte level.

Note that, of course, some objects are inherently impossible to send via bytes representations due to needing context in the running program, such as pointers and file handles.

Additional note

Another hazard to guard against is interpreting a byte buffer as another object. The C standard defines the behavior of accessing the bytes of an object (for example, something defined as a structure) using a character type, but it does not define the reverse. Sometimes naïve programmers will create an array of character type, read a network message into it, and then convert a pointer into the array to a pointer to a structure type. This runs afoul of two issues:

The conversion is not defined if the alignment is not correct. (This should not be a problem with a packed array, which we would expect to have an alignment requirement of one byte.)
Accessing an array of characters as a different, incompatible type is not defined by the C standard.

The proper way to reassemble received bytes into an object is to copy them either into memory declared as the desired type or memory allocated (as with malloc) for the purpose of interpreting it as the intended object. This can be done by copying bytes from a buffer into the target memory or by directly passing the target memory to the network read routine, for it to fill in the bytes directly.

are you saying to memcpy the bytes into a struct on the receiving end? do you have a sample code to demonstrate further? And ideally I'd wanna avoid dynamic allocation since this is for embedded system — xyf, Oct 31 '21 at 08:42
"The standard does not require that uint8_t be a character type." --> interesting. — chux - Reinstate Monica, Oct 31 '21 at 14:52
@xyf: To put bytes into a structure when receiving them from a network, simply pass the address of the structure to the network routine. E.g., `Pkt x; mq_receive(mq, &x, sizeof x, 0);`. If you have already received bytes into another buffer and need to interpret some part of the buffer as a `Pkt`, then you should do a `memcpy`, as with `Pkt x; memcpy(&x, source of bytes, sizeof x);`. — Eric Postpischil, Oct 31 '21 at 14:58
@xyf While the answer is correct in principle, in practice you just cast the `uint8_t *` to a `struct` pointer without `memcpy()`. The struct must be packed and the buffer allocated with proper alignment (I am quite sure the compiler will do the latter, so if you just make sure to use packed structs and handle byte order if necessary, then you should be fine). About byte order you may require that multibyte integers are stored "little endian" so as long as you are on a little-endian system you do not need to do anything. — nielsen, Oct 31 '21 at 15:04
@nielsen: That is a bad idea. Even if it appears to “work,“ it can break easily. Compiler optimizers have become increasing aggressive, including regarding the aliasing rules of the C standard. During optimization, the compiler may treat an access to a structure as if it cannot alias an object defined as an array of `uint8_t` or of a character type. — Eric Postpischil, Oct 31 '21 at 15:08
Passing a pointer to a `struct` for a receive function only works if one `struct` covers all possible messages. This is often not the case. In practice, many embedded programmers are "deliberately naive" and just convert the byte buffer pointer into the correct `struct` pointer. — nielsen, Oct 31 '21 at 15:08
@EricPostpischil As I said, you are right in principle, but the performance penalty of copying all communication may be significant. I would personally prefer the pointer conversion unless I actually observe a problem on the given system. So far I have not experienced any problems in practice (as long as the structure is packed which should hint the compiler not to be too aggressive - with unpacked structs I have experienced problems with struct casting on an Arm system). — nielsen, Oct 31 '21 at 15:26
@nielsen: A good compiler will elide the `memcpy` during optimization. Failing that, a software engineer should seek a guarantee from their compiler that it supports this aliasing and documenting that they are using that guarantee. Simply advising people to write code without knowing that its behavior is properly defined is bad engineering, regardless of whether or not you have experienced problems in practice. Compilers change. — Eric Postpischil, Oct 31 '21 at 15:30
@EricPostpischil "To put bytes into a structure when receiving them from a network, simply pass the address of the structure to the network routine" - this applies given the receiver end is aware of `Pkt` struct (used the same way as the sender did) so you could `memcpy` the received bytes into `Pkt`. And if the receiver wouldn't bother about having `Pkt` allocated, it could parse the data via accessing bytes `(unsigned char *)` like I did in my example (except of `uint8_t*`, it's `unsigned char*`), yes? — xyf, Oct 31 '21 at 17:31
something like this in case the receiver is not aware of `Pkt` struct: https://cplayground.com/?p=gaur-cockroach-monarch — xyf, Oct 31 '21 at 17:37

How safe is casting a struct to uint8_t * or char * and accessing it via bytestream?

1 Answers1

A character type should be used.

The sender and the receiver must agree on the representations of the objects or the protocol used for sending them.

Additional note