Portability of using union for conversion

Question

I want to represent a 32-bit number using RGBA values, is it portable to generate the values for said number using a union? Consider this C code;

union pixel {
    uint32_t value;
    uint8_t RGBA[4];
};

This compiles fine, and id like to use it instead of a bunch of functions. But is it safe?

How do you plan to use the union, and what results do you hope to achieve? — Nate Eldredge, Apr 22 '21 at 03:54
If you assign a value such as 0x01234567 to `value`, the number in `RGBA[0]` will be different depending on whether the platform is big-endian (0x01) or little-endian (0x67). Therefore, it is not portable across platforms with different endianness. — Jonathan Leffler, Apr 22 '21 at 03:54
@JonathanLeffler is endianness the only issue? If so, it can always be accounted for. — QuestionLimitGoBrrrrr, Apr 22 '21 at 04:00
@QuestionLimitGoBrrrrr, I think it's also unspecified behavior--allowed only by a gcc extension, but I can't find the reference in the gcc manual at the moment. I added some more links to the bottom of my answer though for you to go dig down. — Gabriel Staples, Apr 22 '21 at 04:12
@GabrielStaples: In C, reading a union other than the last one written reinterprets the bytes in the new type; it is not unspecified or undefined. In C++, the behavior would be undefined. — Eric Postpischil, Apr 22 '21 at 07:27
@EricPostpischil, `In C, reading a union other than the last one written reinterprets the bytes in the new type`. Indeed, it does do this. I've done this in both C and C++ with the gcc compiler and saw no difference. But, that's due to a gcc extension, apparently, making it so. But, the C standard states it is "unspecified behavior." See the Annex J "Portability Issues" screenshot and quote for the C standard in my answer. — Gabriel Staples, Apr 22 '21 at 07:31
@GabrielStaples: Annex J is non-normative. In my other comment, I cited normative text. Additionally, the text you highlight is inapplicable. The bytes of a four-byte integer and a four-byte array overlap exactly; none are left unspecified in this case. — Eric Postpischil, Apr 22 '21 at 07:35
@EricPostpischil, can you help me find the standard? Do I need to go buy it? I don't have a copy of a final standard. Or, can you find the words in [the one I link to](https://web.archive.org/web/20181230041359/http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf)? — Gabriel Staples, Apr 22 '21 at 07:37
[This question](https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents) has information about locating the C standard. — Eric Postpischil, Apr 22 '21 at 07:40

score 6 · Accepted Answer · edited Apr 08 '22 at 22:06

Using Unions for "type punning" is fine in C, and fine in gcc's C++ as well (as a gcc [g++] extension). But, "type punning" via unions has hardware architecture endianness considerations.

This is called "type punning", and it is not directly portable due to endianness considerations. However, otherwise, doing it is just fine. The C standards have NOT been great about making it clear this is just fine, but apparently it is. Read these answers and sources:

Is type-punning through a union unspecified in C99, and has it become specified in C11?
Unions and type-punning
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type%2Dpunning - type punning is allowed in gcc C and C++

Additionally, the C18 draft, N2176 ISO/IEC 9899:2017 states in section "6.5.2.3 Structure and union members", the following in footnote 97:

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called “type punning”). This might be a trap representation.

See it in this screenshot here:

So, having

typedef union my_union_u
{
    uint32_t value;
    /// A byte array large enough to hold the largest of any value in the union.
    uint8_t bytes[sizeof(uint32_t)];
} my_union_t;

as a means of translating value into bytes is just fine in C. In C++ it works as a GNU gcc extension (but not as part of the C++ standard). See @Christoph's explanation in his answer here:

GNU extensions to standard C++ (and to C90) do explicitly allow type-punning with unions. Other compilers that don't support GNU extensions may also support union type-punning, but it's not part of the base language standard.

Download the code: you can download and run all the code below from my eRCaGuy_hello_world repo here: "type_punning.c". gcc build and run commands for both C and C++ are found in the comments at the very top of the file.

So, you can do something like this to read the individual bytes out of the uint32_t value:

TECHNIQUE 1: union-based type punning (this is "type punning"):

This is what "type punning" means: writing one type into a union and then reading out another type, thereby using the union to perform type "conversion".

my_union_t u;

// write to uint32_t value
u.value = 1234;

// read individual bytes from uint32_t value
printf("1st byte = 0x%02X\n", (u.bytes)[0]);
printf("2nd byte = 0x%02X\n", (u.bytes)[1]);
printf("3rd byte = 0x%02X\n", (u.bytes)[2]);
printf("4th byte = 0x%02X\n", (u.bytes)[3]);

Sample output:

On a little-endian architecture:

1st byte = 0xD2
2nd byte = 0x04
3rd byte = 0x00
4th byte = 0x00

On a big-endian architecture:

1st byte = 0x00
2nd byte = 0x00
3rd byte = 0x04
4th byte = 0xD2

You can use raw pointers to obtain bytes from variables too, but this technique also has hardware architecture endianness issues.

This could be done withOUT a union if you wanted by using raw pointers too, like this:

TECHNIQUE 2: reading through raw pointers (this is not "type punning"):

uint32_t value = 1234;
uint8_t *bytes = (uint8_t *)&value;

// read individual bytes from uint32_t value
printf("1st byte = 0x%02X\n", bytes[0]);
printf("2nd byte = 0x%02X\n", bytes[1]);
printf("3rd byte = 0x%02X\n", bytes[2]);
printf("4th byte = 0x%02X\n", bytes[3]);

Sample output:

On a little-endian architecture:

1st byte = 0xD2
2nd byte = 0x04
3rd byte = 0x00
4th byte = 0x00

On a big-endian architecture:

1st byte = 0x00
2nd byte = 0x00
3rd byte = 0x04
4th byte = 0xD2

You can use bitmasks and bit-shifting to avoid hardware architecture endianness portability issues.

To avoid endianness issues which exist with both the union type punning and raw pointer approaches above, you can use something like the following instead. This avoids endianness differences between hardware architectures:

TECHNIQUE 3.1: use bit-masks and bit shifting (this is not "type punning"):

uint32_t value = 1234;

uint8_t byte0 = (value >> 0)  & 0xff;
uint8_t byte1 = (value >> 8)  & 0xff;
uint8_t byte2 = (value >> 16) & 0xff;
uint8_t byte3 = (value >> 24) & 0xff;

printf("1st byte = 0x%02X\n", byte0);
printf("2nd byte = 0x%02X\n", byte1);
printf("3rd byte = 0x%02X\n", byte2);
printf("4th byte = 0x%02X\n", byte3);

Sample output (the above technique is endianness-independent!):

On a all architectures: both big-endian AND little-endian:

1st byte = 0xD2
2nd byte = 0x04
3rd byte = 0x00
4th byte = 0x00

OR:

TECHNIQUE 3.2: use a convenience macro to do bit-masks and bit shifting:

#define BYTE(value, byte_num) ((uint8_t)(((value) >> (8*(byte_num))) & 0xff))

uint32_t value = 1234;

uint8_t byte0 = BYTE(value, 0);
uint8_t byte1 = BYTE(value, 1);
uint8_t byte2 = BYTE(value, 2);
uint8_t byte3 = BYTE(value, 3);

// OR

uint8_t bytes[] = {
    BYTE(value, 0), 
    BYTE(value, 1), 
    BYTE(value, 2), 
    BYTE(value, 3), 
};

printf("1st byte = 0x%02X\n", byte0);
printf("2nd byte = 0x%02X\n", byte1);
printf("3rd byte = 0x%02X\n", byte2);
printf("4th byte = 0x%02X\n", byte3);
printf("---------------\n");
printf("1st byte = 0x%02X\n", bytes[0]);
printf("2nd byte = 0x%02X\n", bytes[1]);
printf("3rd byte = 0x%02X\n", bytes[2]);
printf("4th byte = 0x%02X\n", bytes[3]);

Sample output (the above technique is endianness-independent!):

On a all architectures: both big-endian AND little-endian:

1st byte = 0xD2
2nd byte = 0x04
3rd byte = 0x00
4th byte = 0x00
---------------
1st byte = 0xD2
2nd byte = 0x04
3rd byte = 0x00
4th byte = 0x00

Otherwise, (my_pixel.RGBA)[0], or (u.bytes)[0], might be equal to byte0 (as I've defined it above) if the architecture is Little-endian, or equal to byte3 if the architecture is Big-endian.

See this endianness graphic below: https://en.wikipedia.org/wiki/Endianness. Notice that In big-endian, the most-significant-byte of any given variable is stored first (meaning: in lower addresses) in memory, but in little-endian it is the least-significant-byte that is stored first (in lower addresses) in memory. Also remember that endianness describes byte order, NOT bit order (bit order within a byte has nothing to do with endianness), and that each byte is 2 hex characters, or "nibbles", where a nibble is 4 bits.

According to the Wikipedia article above, networking protocols usually use big-endian byte order, whereas most processors (x86, most ARM, etc.), usually are little-endian (emphasis added):

Big-endianness is the dominant ordering in networking protocols, such as in the internet protocol suite, where it is referred to as network order, transmitting the most significant byte first. Conversely, little-endianness is the dominant ordering for processor architectures (x86, most ARM implementations, base RISC-V implementations) and their associated memory.

More notes regarding whether or not "type punning" is supported by the standard

According to Wikipedia's "Type punning" article, writing to union member value but reading from RGBA[4] is "unspecified behavior". However, @Eric Postpischil points out in his comment below this answer that Wikipedia is wrong. The other references at the top of this answer also don't align with the Wikipedia answer as it is written now.

Eric Postpischil's comment, which I now understand and agree with, states (emphasis added):

The quoted text, about bytes corresponding to union members other than the last one stored, does not apply to this situation. It applies to a case where, for example, a two-byte short member is written and a four-byte int member is read. The extra two bytes are unspecified. This gives a C implementation license to implement the store to the short as a two-byte store (leaving the remaining bytes of the union unchanged) or a four-byte store (perhaps because it is efficient for the processor). In the case at hand, we have a four-byte uint32_t member and a four-byte uint8_t [4] member.

Wikipedia claims (as of 22 Apr. 2021):

For union:

union {
    unsigned int ui;
    float d;
} my_union = { .d = x };

Accessing my_union.ui after initializing the other member, my_union.d, is still a form of type-punning [4] in C and the result is unspecified behavior [5] (and undefined behavior in C++ [6]).

From reference [5] above: "Unspecified Behavior" includes:

The values of bytes that correspond to union members other than the one last stored into (6.2.6.1).

This means that if you store data into one member of a union, but read it from another, which is exactly what you're wanting to use that union for, it is "unspecified behavior" per the C standard.

I think gcc allows type punning (writing into one member of a union, but reading from another member in the union, as a form of "translation") as a "gcc extension", but the C and C++ standards, if using -Wpedantic in your build flags, otherwise prohibit it.

Portability of using union for conversion

1 Answers1

Using Unions for "type punning" is fine in C, and fine in gcc's C++ as well (as a gcc [g++] extension). But, "type punning" via unions has hardware architecture endianness considerations.

You can use raw pointers to obtain bytes from variables too, but this technique also has hardware architecture endianness issues.

You can use bitmasks and bit-shifting to avoid hardware architecture endianness portability issues.

More notes regarding whether or not "type punning" is supported by the standard

See also:

Linked