2

In my recent post about endianness, I was told that one should unionize the types because not doing so can cause UB.

I have found talk of this in other posts as well.

In this post, "strict pointer aliasing violation" is mentioned as a problem when not using unions.

In this post, it is mentioned that 8-bit types don't have to have the same memory alignment as 32 bit types, and that this affects the success of casting.
I don't understand this.

So I would like to know more about the scenarios in which UB can happen in the context of accessing 8 bit array members and casting them to 32 bit to generate a certain endianness via bitshifting, and how unionizing prevents it.

basedchad21
  • 121
  • 7
  • 1
    The links you posted already explain what the problem is: The strict aliasing rule says that accessing a `uint8_t` through a `uint32_t` pointer has undefined behavior. And the alignment issue also seems to be rather clear. If `uint32_t` has a stricter alignment requirement than `uint8_t`, the array might not be correctly aligned for `uint32_t`. What specifically do you not understand? – user17732522 Apr 15 '23 at 14:40
  • 3
    You shouldn't care about the endianness of your system. If you do, you are probably doing something wrong. – n. m. could be an AI Apr 15 '23 at 14:49
  • An 8-bit value, like a `char`, has to be accessible at any alignment, otherwise character arrays wouldn't work. 32-bit or 64-bit values are *allowed* to require higher alignment, because of the way the hardware is designed. So if you try to read an `int` from an odd address, some processors might refuse. – BoP Apr 15 '23 at 15:09
  • When storing different types in a union, the compiler will make sure that they are *all* aligned properly. So then it works. – BoP Apr 15 '23 at 15:12
  • 1
    @basedchad21, "Why is one supposed to unionize a uint8_t [4] array and uint32_t, when converting endianness?" is not always true as there are other ways that do not involve a `union`. _Why_ do you want to convert endianness? Note: _convert endianness_ is slightly different than _generating a certain endianness_. What is the real coding goal? – chux - Reinstate Monica Apr 15 '23 at 15:32
  • All that said, good effort on collecting and presenting the links you have read. That shows welcomed initial research before the question. Worth the nod for that alone. (hint: your question would be better formatted if you used the actual question title (or reasonable summary of it) in your links instead of just `"post"`) – David C. Rankin Apr 15 '23 at 16:13

1 Answers1

2

In my recent post about endianness, I was told that one should unionize the types because not doing so can cause UB.

No. You were told that using a union is one way to do what you were after without invoking UB. It's not the only way.

"strict pointer aliasing violation" is mentioned as a problem when not using unions.

The usual terminology is "strict aliasing", not "strict pointer aliasing". Using a union appropriately is one way to avoid strict aliasing violations, but not the only way.

it is mentioned that 8-bit types don't have to have the same memory alignment as 32 bit types, and that this affects the success of casting.

Yes.

I don't understand this.

So I would like to know more about the scenarios in which UB can happen in the context of accessing 8 bit array members and casting them to 32 bit to generate a certain endianness via bitshifting, and how unionizing prevents it.

The first thing to understand is that in this context, "undefined behavior" means that the C language specification does not define the behavior. It does not mean that the program will necessarily behave differently on your machine than you expect (though it very well might do). It does not mean that the compiler must reject the program, or that the program must diagnose an error at runtime, or such -- all of those are possible, but requiring one or more would make that defined behavior.

C aims to be suitable for implementation on substantially all digital computers, including especially on bare metal -- it was, after all, conceived as a language for writing operating systems. Many of the items that the spec explicitly calls out as having undefined behavior are related to differences in the actual behavior of various machines, past, present, or hypothetical future.


The code in your previous post had this form:

  uint8_t buffer[4];

  // ... assign array element values ...

  switch (*((uint32_t *)buffer)) {
  // ...

The question touches on two main provisions of the language spec. I'll be quoting from C17, but all versions of the language spec to date have substantially equivalent versions of these provisions. Taking the casting question first, because it's straightforward, paragraph C17 6.3.2.3/7 is the main provision allowing conversion among object-pointer types. It says:

A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

(Emphasis added.)

Thus, when people tell you that casting a uint8_t * to type uint32_t * might produce UB on account of mismatching alignment, that's coming directly from the language spec.

The other main relevant portion of the language spec is the so-called "strict aliasing rule", which in C17 is paragraph 6.5/7:

An object shall have its stored value accessed only by an lvalue expression that has one of the following types:

  • a type compatible with the effective type of the object,
  • a qualified version of a type compatible with the effective type of the object,
  • a type that is the signed or unsigned type corresponding to the effective type of the object,
  • a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
  • an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
  • a character type.

Footnote 89 explains that

The intent of this list is to specify those circumstances in which an object may or may not be aliased.

And that's why even if your conversion (uint32_t *)buffer has well defined behavior, you still invoke UB by attempting to access an object via the resulting pointer. The lvalue in question, *((uint32_t *)buffer), has type uint32_t, and that does not satisfy any of the alternatives given in the strict aliasing rule for accessing a uint8_t or an array of such. (Do not be confused by the term "effective type". Except where dynamic memory allocation is involved, you can understand "effective type" simply as "type".)

On the other hand, note that the strict aliasing rule explicitly allows access through a union. Furthermore, the union approach also solves the alignment issue, because a union's alignment requirement has to be at least as strict as that of it most strictly aligned member.


If you study that carefully, you may also see that there is at least one approach that works without involving a union: instead of declaring an array of uint8_t and accessing it via a uint32_t *, declare a uint32_t and access its bytes via a uint8_t *. This does assume that uint8_t is a character type, and the spec does not guarantee that, but in practice, if your implementation provides uint8_t at all, then it will be an alias for unsigned char or possibly for char.

Another way, which does not rely on uint8_t to be a character type, would involve copying the contents of your uint8_t array to an object of type uint32_t via the memcpy() function.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • yes, I just got referred to this https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8 it says that basically only memcpy is "blessed" since "punning" through unions is considered hacky in C, and doesn't work in C++. But a commenter below mentions that unions are widely used in this fashion in the kernel and gcc. This is a real burden, learning something like this. Now I have to question every cast I ever made. – basedchad21 Apr 15 '23 at 16:38
  • I'm sorry to have to break the news to you, @basedchad21. Generally speaking, however, you should not be casting in C other than for arithmetic purposes. If you've been in the habit of committing strict-aliasing violations then yes, you should question all the pointer conversions involved (including those not performed via casts). Some compilers can help you identify (some of) those. For example, GCC's `-Wstrict-aliasing` option (which is also included in `-Wall`) addresses that. – John Bollinger Apr 15 '23 at 16:47