328

I have used unions earlier comfortably; today I was alarmed when I read this post and came to know that this code

union ARGB
{
    uint32_t colour;

    struct componentsTag
    {
        uint8_t b;
        uint8_t g;
        uint8_t r;
        uint8_t a;
    } components;

} pixel;

pixel.colour = 0xff040201;  // ARGB::colour is the active member from now on

// somewhere down the line, without any edit to pixel

if(pixel.components.a)      // accessing the non-active member ARGB::components

is actually undefined behaviour I.e. reading from a member of the union other than the one recently written to leads to undefined behaviour. If this isn't the intended usage of unions, what is? Can some one please explain it elaborately?

Update:

I wanted to clarify a few things in hindsight.

  • The answer to the question isn't the same for C and C++; my ignorant younger self tagged it as both C and C++.
  • After scouring through C++11's standard I couldn't conclusively say that it calls out accessing/inspecting a non-active union member is undefined/unspecified/implementation-defined. All I could find was §9.5/1:

    If a standard-layout union contains several standard-layout structs that share a common initial sequence, and if an object of this standard-layout union type contains one of the standard-layout structs, it is permitted to inspect the common initial sequence of any of standard-layout struct members. §9.2/19: Two standard-layout structs share a common initial sequence if corresponding members have layout-compatible types and either neither member is a bit-field or both are bit-fields with the same width for a sequence of one or more initial members.

  • While in C, (C99 TC3 - DR 283 onwards) it's legal to do so (thanks to Pascal Cuoq for bringing this up). However, attempting to do it can still lead to undefined behavior, if the value read happens to be invalid (so called "trap representation") for the type it is read through. Otherwise, the value read is implementation defined.
  • C89/90 called this out under unspecified behavior (Annex J) and K&R's book says it's implementation defined. Quote from K&R:

    This is the purpose of a union - a single variable that can legitimately hold any of one of several types. [...] so long as the usage is consistent: the type retrieved must be the type most recently stored. It is the programmer's responsibility to keep track of which type is currently stored in a union; the results are implementation-dependent if something is stored as one type and extracted as another.

  • Extract from Stroustrup's TC++PL (emphasis mine)

    Use of unions can be essential for compatness of data [...] sometimes misused for "type conversion".

Above all, this question (whose title remains unchanged since my ask) was posed with an intention of understanding the purpose of unions AND not on what the standard allows E.g. Using inheritance for code reuse is, of course, allowed by the C++ standard, but it wasn't the purpose or the original intention of introducing inheritance as a C++ language feature. This is the reason Andrey's answer continues to remain as the accepted one.

Community
  • 1
  • 1
legends2k
  • 31,634
  • 25
  • 118
  • 222
  • 12
    Simply stated, compilers are allowed to insert padding between elements in a structure. Thus, `b, g, r,` and `a` may not be contiguous, and thus not match the layout of a `uint32_t`. This is in addition to the Endianess issues that others have pointed out. – Thomas Matthews Feb 22 '10 at 18:48
  • Thanks AndreyT for giving a practical example on the usage of unions (which is to save space), I understood it fully for that explanation and ammoQ's code. – legends2k Feb 22 '10 at 23:23
  • 11
    This is exactly why you shouldn't tags questions C and C++. The answers are different, but since answerers do not even tell for what tag they are answering (do they even know?), you get rubbish. – Pascal Cuoq Aug 17 '13 at 06:45
  • 7
    @downvoter Thanks for not explaining, I understand that you want me to magically understand your gripe and not repeat it in future :P – legends2k Aug 06 '14 at 14:58
  • 1
    Regarding the original intention of having _union_, bear in mind that the C standard post-dates C unions by several years. A quick look at Unix V7 shows a few type conversions via unions. – ninjalj Apr 20 '15 at 00:17
  • Related: [What is trap representation?](http://stackoverflow.com/q/6725809/183120) – legends2k Sep 09 '15 at 10:36
  • 4
    `scouring C++11's standard I couldn't conclusively say that it calls out accessing/inspecting a non-active union member is undefined [...] All I could find was §9.5/1` ...really? you quote an exception _note_, not _the main point right at the start of the paragraph_: **"In a union, at most one of the non-static data members can be active at any time, that is, the value of at most one of the non-static data members can be stored in a union at any time."** - and down to p4: "In general, **one must use explicit destructor calls and placement new operators to change the active member of a union**" – underscore_d Jul 08 '16 at 20:36
  • 1
    One of the reasons it is not a defined behaviour in your very case is because endianness of the underlying storage of primitive types is undefined, so converting from `uint32_t` to a serie of `uint8_t` could reorder which byte ends up in which `uint8_t` in absolutely any possible order. – Ludovic Zenohate Lagouardette Dec 02 '18 at 14:00
  • @underscore_d Sure it says all that, but only that much; a compiler author can take it as a. _unspecified_, b. _implementation-defined_ or c. _undefined_. It's not explicitly called out. – legends2k Jul 22 '19 at 08:23

16 Answers16

539

The purpose of unions is rather obvious, but for some reason people miss it quite often.

The purpose of union is to save memory by using the same memory region for storing different objects at different times. That's it.

It is like a room in a hotel. Different people live in it for non-overlapping periods of time. These people never meet, and generally don't know anything about each other. By properly managing the time-sharing of the rooms (i.e. by making sure different people don't get assigned to one room at the same time), a relatively small hotel can provide accommodations to a relatively large number of people, which is what hotels are for.

That's exactly what union does. If you know that several objects in your program hold values with non-overlapping value-lifetimes, then you can "merge" these objects into a union and thus save memory. Just like a hotel room has at most one "active" tenant at each moment of time, a union has at most one "active" member at each moment of program time. Only the "active" member can be read. By writing into other member you switch the "active" status to that other member.

For some reason, this original purpose of the union got "overridden" with something completely different: writing one member of a union and then inspecting it through another member. This kind of memory reinterpretation (aka "type punning") is not a valid use of unions. It generally leads to undefined behavior is described as producing implementation-defined behavior in C89/90.

EDIT: Using unions for the purposes of type punning (i.e. writing one member and then reading another) was given a more detailed definition in one of the Technical Corrigenda to the C99 standard (see DR#257 and DR#283). However, keep in mind that formally this does not protect you from running into undefined behavior by attempting to read a trap representation.

Juan Carlos Ramirez
  • 2,054
  • 1
  • 7
  • 22
AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • 51
    +1 for being elaborate, giving a simple practical example and saying about the legacy of unions! – legends2k Feb 22 '10 at 22:52
  • 1
    In keeping with the example, you could associate the access of the union through another member other than the currently written one, akin to calling the hotel room looking for a, but finding b.Someone picked up the phone, but its not safe. – hiddensunset4 Jan 29 '11 at 09:53
  • 1
    @Nick: It has never been legal to use unions for type punning until very recently, when the practice was standardized by the C committee. – AnT stands with Russia Jan 18 '12 at 20:57
  • 2
    @AndreyT: So is it now legal (as in standardized in C and C++) to do this type of punning and still be portable? If so which versions of the language standards sanctify it? – legends2k May 07 '13 at 14:36
  • 6
    The problem I have with this answer is that most OSes I have seen have header files that do this exact thing. For example I've seen it in old (pre-64-bit) versions of `` on both Windows and Unix. Dismissing it as "not valid" and "undefined" isn't really sufficient if I'm going to be called upon to understand code that works in this exact way. – T.E.D. Jul 15 '13 at 18:38
  • 1
    I strongly disagree. I most commonly see unions used _for aliasing_, specifically for [reasons outlined below](http://stackoverflow.com/a/18177444/111307). – bobobobo Aug 11 '13 at 22:43
  • 2
    @bobobobo: What you see is not necessarily relevant and, in any case, not grounds for "disagreeing". Until relatively recent corrections to the language spec, the language explicitly prohibited using unions for "aliasing" (i.e. for type punning). And while modern C allows such usage, it is still not the original purpose of unions. Unions were introduced for memory sharing, as described in my answer, not for type punning. – AnT stands with Russia Aug 12 '13 at 04:08
  • "It generally leads to undefined behavior" is a pretty inaccurate statement. If unions could be used to store different types of data at different times as you stated using the same piece of memory (reliably, without undefined behavior occurring), then it's pretty concrete what would happen in the example I showed below, or even with type-punning. Although aliasing might have been an unforeseen use of unions, it definitely is _not_ undefined behavior. The results in my example are predictable and concrete. – bobobobo Aug 13 '13 at 01:41
  • Well, [reading your answer](http://stackoverflow.com/a/1812932/111307) about the other question kind of explains what you meant. Hmm. – bobobobo Aug 13 '13 at 01:45
  • So _theoretically_ `elts[0]` could map to `vec.z` in my example. But practically speaking, compiler implementers (and the new standard) have made the behavior as you would expect. – bobobobo Aug 13 '13 at 01:52
  • 35
    @AndreyT “It has never been legal to use unions for type punning until very recently”: 2004 is not “very recent”, especially considering that it is only C99 that was initially clumsily worded, appearing to make type-punning through unions undefined. In reality, type-punning though unions is legal in C89, legal in C11, and it was legal in C99 all along although it took until 2004 for the committee to fix incorrect wording, and the subsequent release of TC3. http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm – Pascal Cuoq Aug 17 '13 at 06:53
  • 2
    Upvoted for the very first sentence in the post. Of course, you could elaborate on why people use unions for type punning; I'm sure you know it's not just for "some" reason. :) – alecov Aug 17 '13 at 14:18
  • 1
    @PascalCuoq: Stroustrup, in his TC++PL, states the purpose of unions on the lines of Andrey's: "Use of unions can be essential for compatness of data... sometimes misused for _type conversion_". – legends2k Oct 11 '13 at 05:46
  • 9
    @legends2k Programming language are defined by standard. The Technical Corrigendum 3 of the C99 standard explicitly allows type-punning in its footnote 82, which I invite you to read for yourself. This is not TV where rock stars are interviewed and express their opinions on climate change. Stroustrup's opinion has zero influence on what the C standard says. – Pascal Cuoq Oct 11 '13 at 05:57
  • @PascalCuoq: Of course, I know that any individual's opinion doesn't matter and only the standard does. My original question was not for a language lawyer, but to understand the intended purpose/rationale behind having such a feature in the language. Please read the title of the post. – legends2k Oct 11 '13 at 06:16
  • @PascalCuoq Stroustrup is talking about C++ where, according to the C++ spec, it's not legal. – bames53 Jan 07 '15 at 20:33
  • 1
    @AnT When you say now it's officially allowed, you refer to C, not C++, correct? http://stackoverflow.com/a/11996970/969365 – Antonio Mar 23 '15 at 12:15
  • 1
    @Antonio: Yes, I refer exclusively to C. – AnT stands with Russia Mar 23 '15 at 19:05
  • Expanding on that analogy, in the most hotels I'll visit, there's going to be a pretty good chance to see a ghost of the latest visitor floating around... – rr- Aug 15 '15 at 12:35
  • 9
    @legends2k "_I know that any individual's opinion doesn't matter and only the standard does_" The opinion of compiler writers matters a lot more than the (extremely poor) language "specification". – curiousguy Oct 14 '15 at 00:08
  • 3
    @T.E.D. Just because it's undefined, that doesn't mean it's unpredictable; it just means that the language specification doesn't spell out what happens when you try it. This is a case where it could effectively be considered "unofficially defined", as most compiler suppliers specifically code their compilers to allow unions to be used for type punning (because it would break a ton of code if it didn't work). However, relying on this behaviour will end up causing trouble whenever you run into a compiler that doesn't, or if any of the compilers that do change how they handle unions. – Justin Time - Reinstate Monica Sep 21 '16 at 17:00
  • The early drafts on C89 speak of type punning through unions. As far as I know, it has always been allowed. And it is the main reason why you use unions. – Lundin Oct 24 '16 at 06:17
  • @Lundin: Nope. The main purpose of unions is expalaned in sufficient detail in my answer. And it is so basic, natural, intuitive and well known to anyone that debating iot would be ridculous. As for type punning, drafts or not, it was explicitly prohibited in C89/90. – AnT stands with Russia Oct 24 '16 at 06:26
  • 1
    @AnT Cite the relevant part that "explicitly prohibits" it? What I found in some early draft [here](http://flash-gordon.me.uk/ansi.c.txt) was this text (3.3.2.3): "With one exception, if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined./33/". – Lundin Oct 24 '16 at 06:37
  • 1
    And then note 33 says: "The ``byte orders'' for scalar types are invisible to isolated programs that do not indulge in type punning (for example, by assigning to one member of a union and inspecting the storage by accessing another member that is an appropriately sized array of character type), but must be accounted for when conforming to externally-imposed storage layouts." – Lundin Oct 24 '16 at 06:37
  • @Lundin: OK, I take it back. It was not "prohibited", but rather "implementation-defined". – AnT stands with Russia Oct 24 '16 at 06:50
  • Because it's the only method of type punning left that doesn't immediately invoke strongly undefined behavior when performed on locals. It actually _has a definition_ of what it does and only the byte order stuff makes it nonportable. – Joshua Jun 18 '19 at 19:49
  • yes ,union is to save memory, it from an archaic age in computer science when extension memory were expensive piece of hardware – Damien Mattei Jun 05 '21 at 23:26
49

You could use unions to create structs like the following, which contains a field that tells us which component of the union is actually used:

struct VAROBJECT
{
    enum o_t { Int, Double, String } objectType;

    union
    {
        int intValue;
        double dblValue;
        char *strValue;
    } value;
} object;
Erich Kitzmueller
  • 36,381
  • 5
  • 80
  • 102
  • 2
    I totally agree, without entering the undefined-behaviour chaos, perhaps this is the best intended behaviour of unions I can think of; but won't is waste space when am just using, say `int` or `char*` for 10 items of object[]; in which case, I can actually declare separate structs for each data type instead of VAROBJECT? Wouldn't it reduce clutter and use lesser space? – legends2k Feb 22 '10 at 12:24
  • 3
    legends: In some cases, you simply can't do that. You use something like VAROBJECT in C in the same cases when you use Object in Java. – Erich Kitzmueller Feb 22 '10 at 19:02
  • The data structure of [tagged unions](http://en.wikipedia.org/wiki/Tagged_union) seems to be a only legitimate use of unions, as you explain. – legends2k Jun 12 '14 at 07:25
  • Also give an example of how to use the values. – Ciro Santilli OurBigBook.com May 04 '16 at 15:36
  • @legends2k : Imagine a callback function which has a pointer to a `VAROBJECT` structure. If you didn't have unions, you'd need three different callbacks to achieve the same thing : one fired if the value stored is an `int`, another one if it's a `double`, and yet another one if it's a `char*`. When using unions, you save the hassle of three different function prototypes at the expense of the client having to figure out what's the type of the actual value by checking the associated `enum`. – Daniel Kamil Kozar Jan 30 '18 at 21:02
  • 1
    @CiroSantilli新疆改造中心六四事件法轮功 A part of an example from *C++ Primer*, might help. https://wandbox.org/permlink/cFSrXyG02vOSdBk2 – Rick Sep 29 '18 at 11:42
  • Looks like the correct way to implement Rust's `Option` – Kotauskas May 27 '19 at 15:45
34

The behavior is undefined from the language point of view. Consider that different platforms can have different constraints in memory alignment and endianness. The code in a big endian versus a little endian machine will update the values in the struct differently. Fixing the behavior in the language would require all implementations to use the same endianness (and memory alignment constraints...) limiting use.

If you are using C++ (you are using two tags) and you really care about portability, then you can just use the struct and provide a setter that takes the uint32_t and sets the fields appropriately through bitmask operations. The same can be done in C with a function.

Edit: I was expecting AProgrammer to write down an answer to vote and close this one. As some comments have pointed out, endianness is dealt in other parts of the standard by letting each implementation decide what to do, and alignment and padding can also be handled differently. Now, the strict aliasing rules that AProgrammer implicitly refers to are a important point here. The compiler is allowed to make assumptions on the modification (or lack of modification) of variables. In the case of the union, the compiler could reorder instructions and move the read of each color component over the write to the colour variable.

David Rodríguez - dribeas
  • 204,818
  • 23
  • 294
  • 489
  • +1 for the clear and simple reply! I agree, for portability, the method you've given in the 2nd para holds good; but can I use the way I've put up in the question, if my code is tied down to a single architecture (paying the price of protability), since it saves 4 bytes for each pixel value and some time saved in running that function? – legends2k Feb 22 '10 at 11:39
  • The endian issue doesn't force the standard to declare it as undefined behaviour - reinterpret_cast has exactly the same endian issues, but has implementation defined behaviour. – JoeG Feb 22 '10 at 11:42
  • 2
    @legends2k, the problem is that optimizer may assume that an uint32_t is not modified by writing to a uint8_t and so you get the wrong value when the optimized use that assumption... @Joe, the undefined behavior appears as soon as you access the pointer (I know, there are some exceptions). – AProgrammer Feb 22 '10 at 13:31
  • @AProgrammer: So without hitting undefined behaviour (unions, reintrepret_cast, type-punning...) I cannot do any bit-level manipulations on a embedded machine is it? In my platform memory is at a premium and I cannot afford to allocate like 8 bytes for a 32 bit pixel colour value. – legends2k Feb 22 '10 at 13:49
  • @legends2k, there are some exceptions. I seem to remember to remember that a cast (reinterpret_cast in C++) to a char types is one of them. I don't remember if uint_t is garanteed to be a char type or not. I don't remember similar exception for union (there is an exception for union of structs starting with members of the same types, but that is quite different). Depending on your context, you can check if mask and shift of the uint32_t isn't all what you need with some (inline) access members. – AProgrammer Feb 22 '10 at 14:06
  • 1
    @legends2k/AProgrammer: The result of a reinterpret_cast is implementation defined. Using the pointer returned does not result in undefined behaviour, only in implementation defined behaviour. In other words, the behaviour must be consistant and defined, but it isn't portable. – JoeG Feb 22 '10 at 16:46
  • 1
    @legends2k: any decent optimizer will recognize bitwise operations that select an entire byte and generate code to read/write the byte, same as the union but well-defined (and portable). e.g. uint8_t getRed() const { return colour & 0x000000FF; } void setRed(uint8_t r) { colour = (colour & ~0x000000FF) | r; } – Ben Voigt Feb 22 '10 at 22:13
  • What Ben Voigt said, with the addition that you can mark those functions as `inline` (or define them as macros), which should allow the optimiser to produce similar or identical code to the `union` construct. – caf Feb 22 '10 at 22:32
  • @AProgrammer/Ben Voigt/caf: Lesson learnt; I'll avoid this type of usage with unions altogether and resort to masking and shifting, it's portable by all means :) – legends2k Feb 22 '10 at 23:01
  • @JoeGauterin "_The result of a reinterpret_cast is implementation defined._" Right. "_Using the pointer returned does not result in undefined behaviour, only in implementation defined behaviour._" Wrong. This behaviours is not defined by the standard, and does not have to be defined by implementation. Or tell us the definition! – curiousguy Oct 03 '11 at 17:19
  • 1
    @curiousguy: The Standard specifies that if sizeof (sometype) reports N, then converting a pointer to that type to char* and reading N values will yield some (not necessarily unique) sequence of unsigned char values. It also specifies that overwriting an object with such a sequence of char values will set its value to the value that would have yielded that sequence. The Standard could have specified that a union most behave as though it holds a sequence of unsigned char values, and the effects of reading and writing would be defined in terms of the effects on those values. – supercat Jun 28 '16 at 00:20
  • @curiousguy: I don't know that there are any implementations that define contrary behavior (as opposed to not specifying any), so an implementation which always behaved in that fashion should be able to satisfy all behavioral expectations. – supercat Jun 28 '16 at 00:21
  • If what you say in the first paragraph were much of a problem, then C wouldn't allow type punning via unions. But in C99 it is [explicitly allowed](http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm). – Ruslan Oct 10 '18 at 10:16
  • @Ruslan: read the link carefully, it explicitly states in C89 that it is implementation defined, the same is implicitly stated in C99 with *reinterpreted as an object representation* --that reinterpretation is implementation defined. C is a lower level language than C++, and C++ makes that undefined behaviour (even though some compilers --gcc-- explicitly document and support it). The concerns in the first paragraph are there in C89/99/11 – David Rodríguez - dribeas Oct 19 '18 at 20:01
  • Reinterpretation as an object representation is much more specific than simply implementation-defined. But in any case, my point is that since it was not a problem to standardize the _behavior_ for the C standards committee, the differences between CPUs in memory alignment requirements etc. don't automatically call for undefined behavior. It's purely the problem of actually doing the definition (at least for POD types), albeit making it implementation-defined. C++ standards committee apparently didn't find much need in this, unlike the C committee, so we have UB in C++ here. – Ruslan Oct 19 '18 at 22:02
  • @Ruslan: I am not sure if there's maybe a misconception on what undefined behaviour means. It does not mean "will fail" or "cannot be standardised", it means that the language chooses not to standardise it. In many cases this is driven by cannot be standardised across platforms, so the committee decides whether it should be implementation defined (the implementation must choose one behaviour) or undefined (the implementation does not need to make the choice, two instances in the same environment/program may do different things). – David Rodríguez - dribeas Oct 29 '18 at 10:06
  • Whatever it means, it, unlike implementation defined result, implies that once it happens (even before it does, if it'll decidedly happen), you program doesn't have any semantics whatsoever (as per the Standard). Implementation-defined result lets your program have normal semantics, parametrized only by the result it got. – Ruslan Oct 29 '18 at 10:18
  • @Ruslan: The choice in C means that a compiler cannot reorder accesses through pointers to `T` with accesses to members of a union `U` if any of the members of `U` is or contains a `T`. The choice of allowing that in the language means that the strict aliasing rules are less strict and that you can (potentially) hurt the optimiser. Implementations can still as an extension (xlC and gcc do, probably clang too for compatibility with gcc) define the behaviour. It only means that the standard makes no promise. – David Rodríguez - dribeas Oct 29 '18 at 10:33
  • @DavidRodríguez-dribeas: According to Footnote 88 of the C11 draft, the purpose of the "strict aliasing rules" is to say when things are allowed to alias. The fact that `T` happens to exist within a union type that also contains `U` should not obligate a compiler to assume that a `T*` of unknown provenance might access the same storage as a `U*` of unknown provenance, but in cases where a `T*` is freshly visibly derived from a `U*` a quality compiler should recognize the possibility that the `T*` might identify the storage associated with a `U*`. The authors of C89 no doubt... – supercat Feb 11 '19 at 18:43
  • ...thought that so obvious that it could go without saying, but unfortunately some people assume that anything that isn't in the Standard has ever been a "real" part of C. – supercat Feb 11 '19 at 18:46
  • Hmm... the optimisation issue you mentioned in your first reply here could probably be solved by requiring compilers to evaluate union accesses at object level instead of at member level when determining proper ordering, @AProgrammer. For the OP's `union ARGB { /*...*/ } pixel`, the optimiser might not be able to determine whether modifying the `uint32_t` would affect any of the `uint8_t`s, but it _can_ determine that modifying one of `pixel`'s members modifies `pixel` as a whole, and could be required to base assumptions on that. ...This could definitely result in less optimal code, though. – Justin Time - Reinstate Monica Jun 22 '22 at 21:50
  • @JustinTime-ReinstateMonica: The intention of the "strict aliasing rule" was to allow compilers to make optimizations that wouldn't interfere with what programmers needed to do; the authors expected compiler vendors would know more about what their customers would need to do than the Commitee ever could, and thus *waived jurisdiction* over the legitimacy of optimizations that would interfere with useful but non-portable programs. Given that most code which accesses storage with different types at different time does so using one of two common patterns, and most situations where... – supercat Feb 13 '23 at 18:09
  • ...type-based optimizations could offer significant benefits don't fit either one, there's no reason why quality implementations intended to be suitable for low-level programming shouldn't support the patterns: (derive reference to T1 from a reference to T2; use the reference to T1; abandon that reference), or (derive a reference to T1 from a reference to T2, and abandon all references that will be directly used to access the storage as type T2). The authors of the Standard never imagined anyone writing a "general-purpose" compiler would refuse to handle such patterns. – supercat Feb 13 '23 at 18:14
  • That makes sense, @supercat. I was looking at how it would be possible to determine that the union (or any UDT in general) has been modified, on the grounds that "if one field is modified, the containing type is modified", and extrapolating that compilers could in turn extend that logic so that modifying any field of a union implicitly modifies every field of the union. I wasn't aware that making this assumption would directly contradict the intents and purposes of the strict aliasing rule, though it does make sense when you put it that way. Thanks for the explanation. – Justin Time - Reinstate Monica Jun 08 '23 at 21:08
  • @JustinTime-ReinstateMonica: IMHO, the root problem lies with a a corollary of the "as-if" rule: the only way to allow an optimizing transform is to characterize as UB any action that would allow its effects to be observed. Saying that if/unless a program contains certain directives, a compiler may perform certain transforms when certain conditions apply, *even if doing so would affects certain corner-case behaviors*, would allow more useful transforms to be applied more easily and safely than relying upon "anything can happen" UB, especially in scenarios where... – supercat Jun 08 '23 at 21:21
  • ...a transform would cause a program that would have behaved in one fashion to behave in a manner which is inconsistent with the source code as written *but which would still meet application requirements*. For example, many programs use `fwrite` to output structures that contain a mixture of useful and useless bytes; an optimizing transform that causes the useless bytes to hold values inconsistent with sequential program execution may yield a useful program which is more efficient than any program which whose behavior was consistent with sequential execution could be. – supercat Jun 08 '23 at 21:30
30

The most common use of union I regularly come across is aliasing.

Consider the following:

union Vector3f
{
  struct{ float x,y,z ; } ;
  float elts[3];
}

What does this do? It allows clean, neat access of a Vector3f vec;'s members by either name:

vec.x=vec.y=vec.z=1.f ;

or by integer access into the array

for( int i = 0 ; i < 3 ; i++ )
  vec.elts[i]=1.f;

In some cases, accessing by name is the clearest thing you can do. In other cases, especially when the axis is chosen programmatically, the easier thing to do is to access the axis by numerical index - 0 for x, 1 for y, and 2 for z.

bobobobo
  • 64,917
  • 62
  • 258
  • 363
  • 3
    This is also called `type-punning` which is also mentioned in the question. Also the example in the question shows a similar example. – legends2k Aug 11 '13 at 22:56
  • 7
    It's not type punning. In my example the types _match_, so there is no "pun", it's merely aliasing. – bobobobo Aug 11 '13 at 23:51
  • 4
    Yes, but still, from an absolute viewpoint of the language standard, the member written to and read from are different, which is undefined as mentioned in the question. – legends2k Aug 12 '13 at 12:49
  • 6
    I would hope that a future standard would fix this particular case to be allowed under the "common initial subsequence" rule. However, arrays do not participate in that rule under the current wording. – Ben Voigt Jul 24 '14 at 02:11
  • @legends2k The language specification is so bad that we cannot say that this is well defined or not. – curiousguy Oct 14 '15 at 00:09
  • 3
    @curiousguy: There is clearly no requirement that the structure members be placed without arbitrary padding. If code tests for structure-member placement or structure size, code should work if accesses are done directly through the union, but a strict reading of the Standard would indicate that taking the address of a union or struct member yields a pointer which cannot be used as a pointer of its own type, but must first be converted back to a pointer to the enclosing type or a character type. Any remotely-workable compiler will extend the language by making more things work than... – supercat Jun 28 '16 at 00:14
  • ...the standard requires, and any halfway-decent compiler will extend it somewhat beyond that, but unfortunately people are more interested in writing compilers that will run a few programs fast than that will allow efficiently-written programs to work. – supercat Jun 28 '16 at 00:15
  • 1
    @legends2k: This is fully defined behavior now. The struct and the array have the same padding. – Joshua Jun 18 '19 at 19:51
  • @Joshua Thanks for chipping in. Which is fully defined now, according to which language standard? Please let us know. – legends2k Jun 19 '19 at 10:39
  • so why we can be sure elst[0]==x ? we can't, its platform dependent behavior – Qbik Jul 07 '19 at 20:31
  • @bobobobo: It is generally only "aliasing" in the sense that proponents of the "Strict aliasing rule" abuse the term to justify nonsensical behavior. In non-language-abusing terminology, two references alias if one is used to access the same storage within the other's active lifetime, *neither reference was created from the other*, and at least one of the accesses is a write. Given `u.m1 = 2; x=u.m2;`;`, the access to `u.m1` would create a temporary reference to `m1`, and `u.m2` would create a temporary reference `m2`, but the last use of the first reference would precede the creation... – supercat Feb 13 '23 at 17:57
  • ...of the second. Under non-abusive terminology usage, two references whose lifetimes do not overlap, do not alias. Had the code been written `int *p = &u.m1; float *q = &u.m2; *p = 2; x =*q;`, then `*p` and `*q` would alias since the reference encapsulated by `*p` would be used to modify storage between the creation of the reference encapsulated by `*q` and the last use of that reference to modify the associated storage. – supercat Feb 13 '23 at 18:01
  • @supercat I'm not sure where you got that definition of _alias_. To me, an alias just means "another name for something". In the context of C++, an alias is another name for a variable (eg `int x = 1; int &y = x;`), or a type (eg `using Integer = int;`). Here, `x`, `y`, `z` & `elts`'s lifetimes overlap in any instance of a `Vector3f`. – bobobobo Jun 11 '23 at 02:13
  • @bobobobo: If storage will not be modified within the lifetime of a reference, then nothing would need to know or care about what other references might be used to also read the storage. If a reference is used to create another reference, but won't be used to access storage or create another reference within the lifetime of the derived reference, semantics would be equivalent to having the derived reference being the only reference within its lifetime, after which the reference from which it was derived would go back to being the only reference. – supercat Jun 11 '23 at 19:23
  • @supercat yes, but, whether or not you use a variable doesn't change what type of variable it is – bobobobo Jun 12 '23 at 05:26
  • @bobobobo: Aliasing has almost nothing to do with types, except that a compiler may assume that objects of different types won't alias. The act of forming a temporary reference, using it, and abandoning it, without any intervening use of the original reference, however, does is an *expected* pattern which would only be called "aliasing" by people who want to justify the willful blindness of clang and gcc toward such constructs. Making N1570 6.5p7 handle the vast majority of cases that presently require `-fno-strict-aliasing` would merely saying that storage *which is used...* – supercat Jun 12 '23 at 15:25
  • *...as a particular type T within some context*... may only be accessed via an lvalue which is, *within that context, freshly visibly derived from* an lvalue of type T. A reference is "freshly" derived from another until the original reference gets used again, and the "context" in the first clause may be drawn broadly or narrowly, at the compiler's leisure, provided that the "context" in the second is drawn at least as broadly. Informally, the compiler should look as broadly for derived references as hard as it looks for opportunities to exploit their absence. – supercat Jun 12 '23 at 15:30
10

As you say, this is strictly undefined behaviour, though it will "work" on many platforms. The real reason for using unions is to create variant records.

union A {
   int i;
   double d;
};

A a[10];    // records in "a" can be either ints or doubles 
a[0].i = 42;
a[1].d = 1.23;

Of course, you also need some sort of discriminator to say what the variant actually contains. And note that in C++ unions are not much use because they can only contain POD types - effectively those without constructors and destructors.

  • Have you used it thus (like in the question)?? :) – legends2k Feb 22 '10 at 11:26
  • It's a bit pedantic, but I don't quite accept "variant records". That is, I'm sure they were in mind, but if they were a priority why not provide them? "Provide the building block because it might be useful to build other things as well" just seems intuitively more likely. Especially given at least one more application that was probably in mind - memory mapped I/O registers, where the input and output registers (while overlapped) are distinct entities with their own names, types etc. –  Feb 22 '10 at 11:45
  • @Stev314 If that was the use they had in mind, they could have made it not be undefined behaviour. –  Feb 22 '10 at 11:49
  • @Neil: +1 for the first to say about the actual usage without hitting undefined behaviour. I guess they could have made it implementation defined like other type punning operations (reinterpret_cast, etc.). But like I asked, have you used it for type-punning? – legends2k Feb 22 '10 at 12:31
  • @Neil - the memory-mapped register example isn't undefined, the usual endian/etc aside and given a "volatile" flag. Writing to an address in this model doesn't reference the same register as reading the same address. Therefore there is no "what are you reading back" issue as you're not reading back - whatever output you wrote to that address, when you read you're just reading an independent input. The only issue is making sure you read the input side of the union and write the output side. Was common in embedded stuff - probably still is. –  Feb 22 '10 at 12:56
  • @legends2k I don't use it because it doesn't really work in C++ for the reason I gave, and because I think it's normally bad design to use variants of any sort. –  Feb 22 '10 at 13:12
8

In C it was a nice way to implement something like an variant.

enum possibleTypes{
  eInt,
  eDouble,
  eChar
}


struct Value{

    union Value {
      int iVal_;
      double dval;
      char cVal;
    } value_;
    possibleTypes discriminator_;
} 

switch(val.discriminator_)
{
  case eInt: val.value_.iVal_; break;

In times of litlle memory this structure is using less memory than a struct that has all the member.

By the way C provides

    typedef struct {
      unsigned int mantissa_low:32;      //mantissa
      unsigned int mantissa_high:20;
      unsigned int exponent:11;         //exponent
      unsigned int sign:1;
    } realVal;

to access bit values.

Totonga
  • 4,236
  • 2
  • 25
  • 31
  • Although both your examples are perfectly defined in the standard; but, hey, using bit fields is sure shot unportable code, isn't it? – legends2k Feb 22 '10 at 12:28
  • 1
    No it isn't. As far as I know its widely supported. – Totonga Feb 22 '10 at 15:57
  • 1
    Compiler support doesn't translate into portable. [The C Book](http://publications.gbdirect.co.uk/c_book/chapter6/bitfields.html): _C_ (thereby C++) _gives no guarantee of the ordering of fields within machine words, so if you do use them for the latter reason, you program will not only be non-portable, it will be compiler-dependent too._ – legends2k May 29 '14 at 15:42
5

Although this is strictly undefined behaviour, in practice it will work with pretty much any compiler. It is such a widely used paradigm that any self-respecting compiler will need to do "the right thing" in cases such as this. It's certainly to be preferred over type-punning, which may well generate broken code with some compilers.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 2
    Isn't there an endian issue? A relatively easy fix compared with "undefined", but worth taking into account for some projects if so. –  Feb 22 '10 at 11:26
5

In C++, Boost Variant implement a safe version of the union, designed to prevent undefined behavior as much as possible.

Its performances are identical to the enum + union construct (stack allocated too etc) but it uses a template list of types instead of the enum :)

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
5

The behaviour may be undefined, but that just means there isn't a "standard". All decent compilers offer #pragmas to control packing and alignment, but may have different defaults. The defaults will also change depending on the optimisation settings used.

Also, unions are not just for saving space. They can help modern compilers with type punning. If you reinterpret_cast<> everything the compiler can't make assumptions about what you are doing. It may have to throw away what it knows about your type and start again (forcing a write back to memory, which is very inefficient these days compared to CPU clock speed).

Nick
  • 27,566
  • 12
  • 60
  • 72
4

Technically it's undefined, but in reality most (all?) compilers treat it exactly the same as using a reinterpret_cast from one type to the other, the result of which is implementation defined. I wouldn't lose sleep over your current code.

JoeG
  • 12,994
  • 1
  • 38
  • 63
  • "_a reinterpret_cast from one type to the other, the result of which is implementation defined._" No, it is not. Implementations do not have to define it, and most do not define it. Also, what would be the allowed implementation defined behaviour of casting some random value to a pointer? – curiousguy Oct 03 '11 at 17:35
4

Others have mentioned the architecture differences (little - big endian).

I read the problem that since the memory for the variables is shared, then by writing to one, the others change and, depending on their type, the value could be meaningless.

eg. union{ float f; int i; } x;

Writing to x.i would be meaningless if you then read from x.f - unless that is what you intended in order to look at the sign, exponent or mantissa components of the float.

I think there is also an issue of alignment: If some variables must be word aligned then you might not get the expected result.

eg. union{ char c[4]; int i; } x;

If, hypothetically, on some machine a char had to be word aligned then c[0] and c[1] would share storage with i but not c[2] and c[3].

philcolbourn
  • 4,042
  • 3
  • 28
  • 33
  • A byte that has to be word aligned? That makes no sense. A **byte** has no alignment requirement, by definition. – curiousguy Oct 03 '11 at 17:54
  • Yes, I probably should have used a better example. Thanks. – philcolbourn May 11 '12 at 11:28
  • @curiousguy: There are many cases where one may wish to have arrays of bytes be word-aligned. If one has many arrays of e.g. 1024 bytes and will frequently wish to copy one to another, having them word aligned may on many systems double the speed of a `memcpy()` from one to another. Some systems might speculatively align `char[]` allocations *that occur outside of structures/unions* for that and other reasons. In the extant example, the assumption that `i` will overlap all for elements of `c[]` is non-portable, but that's because there's no guarantee that `sizeof(int)==4`. – supercat Mar 10 '15 at 16:40
4

For one more example of the actual use of unions, the CORBA framework serializes objects using the tagged union approach. All user-defined classes are members of one (huge) union, and an integer identifier tells the demarshaller how to interpret the union.

Cubbi
  • 46,567
  • 13
  • 103
  • 169
4

In the C language as it was documented in 1974, all structure members shared a common namespace, and the meaning of "ptr->member" was defined as adding the member's displacement to "ptr" and accessing the resulting address using the member's type. This design made it possible to use the same ptr with member names taken from different structure definitions but with the same offset; programmers used that ability for a variety of purposes.

When structure members were assigned their own namespaces, it became impossible to declare two structure members with the same displacement. Adding unions to the language made it possible to achieve the same semantics that had been available in earlier versions of the language (though the inability to have names exported to an enclosing context may have still necessitated using a find/replace to replace foo->member into foo->type1.member). What was important was not so much that the people who added unions have any particular target usage in mind, but rather that they provide a means by which programmers who had relied upon the earlier semantics, for whatever purpose, should still be able to achieve the same semantics even if they had to use a different syntax to do it.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • Appreciate the history lesson, however with the standard defining such and such as undefined, which wasn't the case in the bygone C era where K&R book was the only "standard", one has to be sure in not using it _for whatever purpose_ and enter the UB land. – legends2k Sep 22 '16 at 09:41
  • 2
    @legends2k: When the Standard was written, the majority of C implementations treated unions the same way, and such treatment was useful. A few, however, did not, and the authors of the Standard were loath to brand any existing implementations as "non-conforming". Instead, they figured that if implementers didn't need the Standard to tell them to do something (as evidenced by the fact that they were *already doing it*), leaving it unspecified or undefined would simply preserve the *status quo*. The notion that it should make things less defined than they were before the Standard was written... – supercat Sep 22 '16 at 14:28
  • 2
    ...seems a much more recent innovation. What's particularly sad about all of this is that if compiler writers targeting high-end applications were to figure out how to add useful optimization directives to the language most compilers implemented in the 1990s, rather than gutting features and guarantees that had been supported by "only" 90% of implementations, the result would be a language which could perform better and more reliably than hyper-modern C. – supercat Sep 22 '16 at 14:32
3

As others mentioned, unions combined with enumerations and wrapped into structs can be used to implement tagged unions. One practical use is to implement Rust's Result<T, E>, which is originally implemented using a pure enum (Rust can hold additional data in enumeration variants). Here is a C++ example:

template <typename T, typename E> struct Result {
    public:
    enum class Success : uint8_t { Ok, Err };
    Result(T val) {
        m_success = Success::Ok;
        m_value.ok = val;
    }
    Result(E val) {
        m_success = Success::Err;
        m_value.err = val;
    }
    inline bool operator==(const Result& other) {
        return other.m_success == this->m_success;
    }
    inline bool operator!=(const Result& other) {
        return other.m_success != this->m_success;
    }
    inline T expect(const char* errorMsg) {
        if (m_success == Success::Err) throw errorMsg;
        else return m_value.ok;
    }
    inline bool is_ok() {
        return m_success == Success::Ok;
    }
    inline bool is_err() {
        return m_success == Success::Err;
    }
    inline const T* ok() {
        if (is_ok()) return m_value.ok;
        else return nullptr;
    }
    inline const T* err() {
        if (is_err()) return m_value.err;
        else return nullptr;
    }

    // Other methods from https://doc.rust-lang.org/std/result/enum.Result.html

    private:
    Success m_success;
    union _val_t { T ok; E err; } m_value;
}
Kotauskas
  • 1,239
  • 11
  • 31
1

You can use a a union for two main reasons:

  1. A handy way to access the same data in different ways, like in your example
  2. A way to save space when there are different data members of which only one can ever be 'active'

1 Is really more of a C-style hack to short-cut writing code on the basis you know how the target system's memory architecture works. As already said you can normally get away with it if you don't actually target lots of different platforms. I believe some compilers might let you use packing directives also (I know they do on structs)?

A good example of 2. can be found in the VARIANT type used extensively in COM.

Mr. Boy
  • 60,845
  • 93
  • 320
  • 589
1

@bobobobo code is correct as @Joshua pointed out (sadly I'm not allowed to add comments, so doing it here, IMO bad decision to disallow it in first place):

https://en.cppreference.com/w/cpp/language/data_members#Standard_layout tells that it is fine to do so, at least since C++14

In a standard-layout union with an active member of non-union class type T1, it is permitted to read a non-static data member m of another union member of non-union class type T2 provided m is part of the common initial sequence of T1 and T2 (except that reading a volatile member through non-volatile glvalue is undefined).

since in the current case T1 and T2 donate the same type anyway.

rob
  • 11
  • 4