3

I want to represent a 32-bit number using RGBA values, is it portable to generate the values for said number using a union? Consider this C code;

union pixel {
    uint32_t value;
    uint8_t RGBA[4];
};

This compiles fine, and id like to use it instead of a bunch of functions. But is it safe?

timrau
  • 22,578
  • 4
  • 51
  • 64
  • 1
    How do you plan to use the union, and what results do you hope to achieve? – Nate Eldredge Apr 22 '21 at 03:54
  • 1
    If you assign a value such as 0x01234567 to `value`, the number in `RGBA[0]` will be different depending on whether the platform is big-endian (0x01) or little-endian (0x67). Therefore, it is not portable across platforms with different endianness. – Jonathan Leffler Apr 22 '21 at 03:54
  • @JonathanLeffler is endianness the only issue? If so, it can always be accounted for. – QuestionLimitGoBrrrrr Apr 22 '21 at 04:00
  • @QuestionLimitGoBrrrrr, I think it's also unspecified behavior--allowed only by a gcc extension, but I can't find the reference in the gcc manual at the moment. I added some more links to the bottom of my answer though for you to go dig down. – Gabriel Staples Apr 22 '21 at 04:12
  • @GabrielStaples: In C, reading a union other than the last one written reinterprets the bytes in the new type; it is not unspecified or undefined. In C++, the behavior would be undefined. – Eric Postpischil Apr 22 '21 at 07:27
  • @EricPostpischil, `In C, reading a union other than the last one written reinterprets the bytes in the new type`. Indeed, it does do this. I've done this in both C and C++ with the gcc compiler and saw no difference. But, that's due to a gcc extension, apparently, making it so. But, the C standard states it is "unspecified behavior." See the Annex J "Portability Issues" screenshot and quote for the C standard in my answer. – Gabriel Staples Apr 22 '21 at 07:31
  • @GabrielStaples: Annex J is non-normative. In my other comment, I cited normative text. Additionally, the text you highlight is inapplicable. The bytes of a four-byte integer and a four-byte array overlap exactly; none are left unspecified in this case. – Eric Postpischil Apr 22 '21 at 07:35
  • @EricPostpischil, can you help me find the standard? Do I need to go buy it? I don't have a copy of a final standard. Or, can you find the words in [the one I link to](https://web.archive.org/web/20181230041359/http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf)? – Gabriel Staples Apr 22 '21 at 07:37
  • [This question](https://stackoverflow.com/questions/81656/where-do-i-find-the-current-c-or-c-standard-documents) has information about locating the C standard. – Eric Postpischil Apr 22 '21 at 07:40

1 Answers1

6

Using Unions for "type punning" is fine in C, and fine in gcc's C++ as well (as a gcc [g++] extension). But, "type punning" via unions has hardware architecture endianness considerations.

This is called "type punning", and it is not directly portable due to endianness considerations. However, otherwise, doing it is just fine. The C standards have NOT been great about making it clear this is just fine, but apparently it is. Read these answers and sources:

  1. Is type-punning through a union unspecified in C99, and has it become specified in C11?
  2. Unions and type-punning
  3. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Type%2Dpunning - type punning is allowed in gcc C and C++

Additionally, the C18 draft, N2176 ISO/IEC 9899:2017 states in section "6.5.2.3 Structure and union members", the following in footnote 97:

  1. If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called “type punning”). This might be a trap representation.

See it in this screenshot here:

enter image description here

So, having

typedef union my_union_u
{
    uint32_t value;
    /// A byte array large enough to hold the largest of any value in the union.
    uint8_t bytes[sizeof(uint32_t)];
} my_union_t;

as a means of translating value into bytes is just fine in C. In C++ it works as a GNU gcc extension (but not as part of the C++ standard). See @Christoph's explanation in his answer here:

GNU extensions to standard C++ (and to C90) do explicitly allow type-punning with unions. Other compilers that don't support GNU extensions may also support union type-punning, but it's not part of the base language standard.


Download the code: you can download and run all the code below from my eRCaGuy_hello_world repo here: "type_punning.c". gcc build and run commands for both C and C++ are found in the comments at the very top of the file.


So, you can do something like this to read the individual bytes out of the uint32_t value:

TECHNIQUE 1: union-based type punning (this is "type punning"):

This is what "type punning" means: writing one type into a union and then reading out another type, thereby using the union to perform type "conversion".

my_union_t u;

// write to uint32_t value
u.value = 1234;

// read individual bytes from uint32_t value
printf("1st byte = 0x%02X\n", (u.bytes)[0]);
printf("2nd byte = 0x%02X\n", (u.bytes)[1]);
printf("3rd byte = 0x%02X\n", (u.bytes)[2]);
printf("4th byte = 0x%02X\n", (u.bytes)[3]);

Sample output:

  1. On a little-endian architecture:
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    
  2. On a big-endian architecture:
    1st byte = 0x00
    2nd byte = 0x00
    3rd byte = 0x04
    4th byte = 0xD2
    

You can use raw pointers to obtain bytes from variables too, but this technique also has hardware architecture endianness issues.

This could be done withOUT a union if you wanted by using raw pointers too, like this:

TECHNIQUE 2: reading through raw pointers (this is not "type punning"):

uint32_t value = 1234;
uint8_t *bytes = (uint8_t *)&value;

// read individual bytes from uint32_t value
printf("1st byte = 0x%02X\n", bytes[0]);
printf("2nd byte = 0x%02X\n", bytes[1]);
printf("3rd byte = 0x%02X\n", bytes[2]);
printf("4th byte = 0x%02X\n", bytes[3]);

Sample output:

  1. On a little-endian architecture:
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    
  2. On a big-endian architecture:
    1st byte = 0x00
    2nd byte = 0x00
    3rd byte = 0x04
    4th byte = 0xD2
    

You can use bitmasks and bit-shifting to avoid hardware architecture endianness portability issues.

To avoid endianness issues which exist with both the union type punning and raw pointer approaches above, you can use something like the following instead. This avoids endianness differences between hardware architectures:

TECHNIQUE 3.1: use bit-masks and bit shifting (this is not "type punning"):

uint32_t value = 1234;

uint8_t byte0 = (value >> 0)  & 0xff;
uint8_t byte1 = (value >> 8)  & 0xff;
uint8_t byte2 = (value >> 16) & 0xff;
uint8_t byte3 = (value >> 24) & 0xff;

printf("1st byte = 0x%02X\n", byte0);
printf("2nd byte = 0x%02X\n", byte1);
printf("3rd byte = 0x%02X\n", byte2);
printf("4th byte = 0x%02X\n", byte3);

Sample output (the above technique is endianness-independent!):

  1. On a all architectures: both big-endian AND little-endian:
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    

OR:

TECHNIQUE 3.2: use a convenience macro to do bit-masks and bit shifting:

#define BYTE(value, byte_num) ((uint8_t)(((value) >> (8*(byte_num))) & 0xff))

uint32_t value = 1234;

uint8_t byte0 = BYTE(value, 0);
uint8_t byte1 = BYTE(value, 1);
uint8_t byte2 = BYTE(value, 2);
uint8_t byte3 = BYTE(value, 3);

// OR

uint8_t bytes[] = {
    BYTE(value, 0), 
    BYTE(value, 1), 
    BYTE(value, 2), 
    BYTE(value, 3), 
};

printf("1st byte = 0x%02X\n", byte0);
printf("2nd byte = 0x%02X\n", byte1);
printf("3rd byte = 0x%02X\n", byte2);
printf("4th byte = 0x%02X\n", byte3);
printf("---------------\n");
printf("1st byte = 0x%02X\n", bytes[0]);
printf("2nd byte = 0x%02X\n", bytes[1]);
printf("3rd byte = 0x%02X\n", bytes[2]);
printf("4th byte = 0x%02X\n", bytes[3]);

Sample output (the above technique is endianness-independent!):

  1. On a all architectures: both big-endian AND little-endian:
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    ---------------
    1st byte = 0xD2
    2nd byte = 0x04
    3rd byte = 0x00
    4th byte = 0x00
    

Otherwise, (my_pixel.RGBA)[0], or (u.bytes)[0], might be equal to byte0 (as I've defined it above) if the architecture is Little-endian, or equal to byte3 if the architecture is Big-endian.

See this endianness graphic below: https://en.wikipedia.org/wiki/Endianness. Notice that In big-endian, the most-significant-byte of any given variable is stored first (meaning: in lower addresses) in memory, but in little-endian it is the least-significant-byte that is stored first (in lower addresses) in memory. Also remember that endianness describes byte order, NOT bit order (bit order within a byte has nothing to do with endianness), and that each byte is 2 hex characters, or "nibbles", where a nibble is 4 bits.

enter image description here

According to the Wikipedia article above, networking protocols usually use big-endian byte order, whereas most processors (x86, most ARM, etc.), usually are little-endian (emphasis added):

Big-endianness is the dominant ordering in networking protocols, such as in the internet protocol suite, where it is referred to as network order, transmitting the most significant byte first. Conversely, little-endianness is the dominant ordering for processor architectures (x86, most ARM implementations, base RISC-V implementations) and their associated memory.


More notes regarding whether or not "type punning" is supported by the standard

According to Wikipedia's "Type punning" article, writing to union member value but reading from RGBA[4] is "unspecified behavior". However, @Eric Postpischil points out in his comment below this answer that Wikipedia is wrong. The other references at the top of this answer also don't align with the Wikipedia answer as it is written now.

Eric Postpischil's comment, which I now understand and agree with, states (emphasis added):

The quoted text, about bytes corresponding to union members other than the last one stored, does not apply to this situation. It applies to a case where, for example, a two-byte short member is written and a four-byte int member is read. The extra two bytes are unspecified. This gives a C implementation license to implement the store to the short as a two-byte store (leaving the remaining bytes of the union unchanged) or a four-byte store (perhaps because it is efficient for the processor). In the case at hand, we have a four-byte uint32_t member and a four-byte uint8_t [4] member.

Wikipedia claims (as of 22 Apr. 2021):

For union:

union {
    unsigned int ui;
    float d;
} my_union = { .d = x };

Accessing my_union.ui after initializing the other member, my_union.d, is still a form of type-punning [4] in C and the result is unspecified behavior [5] (and undefined behavior in C++ [6]).

From reference [5] above: "Unspecified Behavior" includes:

The values of bytes that correspond to union members other than the one last stored into (6.2.6.1).

This means that if you store data into one member of a union, but read it from another, which is exactly what you're wanting to use that union for, it is "unspecified behavior" per the C standard.

enter image description here

I think gcc allows type punning (writing into one member of a union, but reading from another member in the union, as a form of "translation") as a "gcc extension", but the C and C++ standards, if using -Wpedantic in your build flags, otherwise prohibit it.

See also:

  1. Download and run all of the above code from my repo here: https://github.com/ElectricRCAircraftGuy/eRCaGuy_hello_world/blob/master/c/type_punning.c
  2. Unions, aliasing and type-punning in practice: what works and what does not?
  3. Unions and type-punning
  4. [my repo] I added READ_BYTE() as a macro to my utilities.h file in my eRCaGuy_hello_world repo.
  5. Where do I find the current C or C++ standard documents?
  6. https://news.ycombinator.com/item?id=17263328
    1. Is type-punning through a union unspecified in C99, and has it become specified in C11? <== SEE HERE ESPECIALLY. APPARENTLY THE C STANDARD HASN'T BEEN GOOD ABOUT BEING SUPER CLEAR ABOUT THIS.
  7. More of my answers:
    1. Answer 1/3: use a union and a packed struct.
    2. Answer 2/3: convert a struct to an array of bytes via manual bit-shifting.
    3. Answer 3/3: use a packed struct and a raw uint8_t pointer to it.
halfer
  • 19,824
  • 17
  • 99
  • 186
Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
  • Wikipedia is not a great reference for C language behaviour, it would be better to quote direcly the sources that wikipedia references . I disagree that it is unspecified behaviour in C, it is implementation-defined (based on the system endianness and sizes of types) – M.M Apr 22 '21 at 04:11
  • @M.M, yeah, this would require further research on my end. For now though I'll let it rest. – Gabriel Staples Apr 22 '21 at 04:14
  • @M.M, note: if you see something wrong on Wikipedia, and you've done the detailed research to be absolutely sure, please just fix it. That's what it's there for: a wiki, for anyone to edit. I do the same. I edit and add to https://cppreference.com (also a wiki) and [Wikipedia.org](https://www.wikipedia.org/) regularly. – Gabriel Staples Apr 22 '21 at 04:26
  • 2
    I spend enough time on internet arguments, don't really want to deal with competing edits from someone with a different view :) The solution would be to link to a canonical SO question anyway I would think – M.M Apr 22 '21 at 04:28
  • 1
    @GabrielStaples: In C, reading a union other than the last one written reinterprets the bytes in the new type, per C 2018 6.5.2.3 3 and note 99. It is not unspecified or undefined. – Eric Postpischil Apr 22 '21 at 07:30
  • @EricPostpischil, since I'm using the [latest draft copy of that standard](https://web.archive.org/web/20181230041359/http://www.open-std.org/jtc1/sc22/wg14/www/abq/c17_updated_proposed_fdis.pdf), as found under the "C17/C18" section of [this answer](https://stackoverflow.com/a/83763/4561887), because I don't want to pay for a paid copy for the sake of argument alone, can you please provide the exact quote instead of the section numbers, so I can find it in the draft version too? A highlighted screenshot would also suffice. I'm trying to get my answer "upvotable" and correct here. – Gabriel Staples Apr 22 '21 at 07:51
  • Please note that this non-wayback-machine link to [N2176](https://files.lhmouse.com/standards/ISO%20C%20N2176.pdf) seems valid. Also, in a previous draft ([N1570](http://port70.net/~nsz/c/c11/n1570.html)), the note mentioned by E.Postpischil is [97](http://port70.net/~nsz/c/c11/n1570.html#6.5.2.3p5). – Bob__ Apr 22 '21 at 08:24
  • @GabrielStaples: 6.5.2.3 3 in the official version is 6.5.2.3 in the draft you link to, and note 99 is note 97. – Eric Postpischil Apr 22 '21 at 12:11
  • 2
    The quoted text, about bytes corresponding to union members other than the last one stored, does not apply to this situation. It applies to a case where, for example, a two-byte `short` member is written and a four-byte `int` member is read. The extra two bytes are unspecified. This gives a C implementation license to implement the store to the `short` as a two-byte store (leaving the remaining bytes of the union unchanged) or a four-byte store (perhaps because it is efficient for the processor). In the case at hand, we have a four-byte `uint32_t` member and a four-byte `uint8_t [4]` member. – Eric Postpischil Apr 22 '21 at 12:15
  • @EricPostpischil, after further research, I agree with you. I've updated my answer as a result. Thank you for your latest comment with detailed explanation. – Gabriel Staples May 02 '21 at 20:54