4

In the comments of this answer it is said that it would be undefined behavior to split up an integer into their bytes using a union like follows. The code given at that place is similar though not identical to this, please give a note if have I changed undefined-behavior-relevant aspects of the code.

union addr {
 uint8_t addr8[4];
 uint32_t addr32;
};

Up to now I thought this would be a fine approach to do things like addr = {127, 0, 0, 1}; and get the corresponding uint32_t in return. (I acknowledge that this may yield different results depending on the endianness of my system. The question however remains.)

Is this undefined behavior? If so, why? (I don't know what means What's UB in C++ is to access inactive union members.)


C99

  • C99 is apparantly pretty close to C++03 in this point.

C++03

  • In a union, at most one of the data members can be active at any time, that is, the value of at most one of the data members can be stored in a union at any time. C++03, Section 9.5 (1), page 162

However

  • If a POD-union contains several POD-structs that share a common initial sequence [...] it is permitted to inspect the common initial sequence of any of POD-struct members ibid.
  • Two POD-struct [...] types are layout-compatible if they have the same number of nonstatic data members, and corresponding nonstatic data members (in order) have layout-compatible types C++03, Section 9.2 (14), page 157
  • If two types T1 and T2 are the same type, then T1 and T2 are layout-compatible types. C++03, Section 3.9 (11), page 53

Conclusion

  • as uint8_t[4] and uint32_t are not the same type (I guess, a strict aliasing thing) (plus both not being POD-structs/union) the above is indeed UB?

C++11

  • Note that aggregate type does not include union type because an object with union type can only contain one member at a time. C++11, Footnote 46, page 42
Community
  • 1
  • 1
moooeeeep
  • 31,622
  • 22
  • 98
  • 187
  • 2
    FWIW, I have litb on record [saying this is not UB](http://chat.stackoverflow.com/transcript/10?m=3138021#3138021). – R. Martinho Fernandes Apr 22 '12 at 20:46
  • 4
    Technically, it is UB; if you read out of union member other than the one that was last written to, you technically get UB. However, unless compiler writers are deliberately trying to make life impossible, it will work OK, with a major caveat about endian-ness in the context of 127.0.0.1 (which looks suspiciously like a big-endian localhost loopback address in IPv4). – Jonathan Leffler Apr 22 '12 at 20:47
  • @JonathanLeffler Acknowledged. Please assume these variables as being anonymized! – moooeeeep Apr 22 '12 at 22:24
  • 1
    @R.MartinhoFernandes: That does not mean that it is not UB. In an union, there is at most one *active* member at a time, but the standard determines that it is valid in some specific case to inspect the *shared initial sequence* of a different member of the union that shares such an initial sequence with the active member. Whether a char array *shares the initial sequence* of an `int` is something to be discussed. – David Rodríguez - dribeas Apr 23 '12 at 01:36
  • @David: Sure, but normally you can access the storage of an int through a char reference. What's so different here? If I grab a reference to the char member, it's ok, but it is not if I do it directly? Put simply, I don't buy this "active member" argument. I'll be happy if someone could provide a definitive quote that says that reading from a non-active member is UB, because I can't. – R. Martinho Fernandes Apr 23 '12 at 03:26

4 Answers4

10

I don't know what means What's UB in C++ is to access inactive union members.

Basically what it means is that the only member you can read from a union without invoking undefined behavior is the last written one. In other words, if you write to addr32, you can only read from addr32, not addr8 and vice versa.

An example is also available here.

Edit: Since there has been much discussion if this is UB or not, consider the following (fully valid) C++11 example;

union olle {
    std::string str;
    std::wstring wstr;
};

Here you can definitely see that activating str and reading wstr may be a problem. You could see this as an extreme example since you even have to activate the member by doing a placement new, but the spec actually covers this case with no mention that it's to be considered a special case in other ways regarding active members.

Community
  • 1
  • 1
Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294
  • I don't think it's allowed to put `std::string` into a union: _An object of a class with a non-trivial constructor [...] cannot be a member of a union_ C++03 9.5 (1). Is this different for C++11 ? – moooeeeep Apr 23 '12 at 07:54
  • 1
    @moooeeeep Yes, the C++11 draft 9.5.4 (don't have the final handy) is actually mentioning that exact case; `Example: Consider an object u of a union type U having non-static data members m of type M and n of type N. If M has a non-trivial destructor and N has a non-trivial constructor (for instance, if they declare or inherit virtual functions), the active member of u can be safely switched from m to n using the destructor and placement new operator as follows:...` – Joachim Isaksson Apr 23 '12 at 08:37
8

[edit: read my edited section below, as I'm now unsure of whether this is undefined behavior or not; I'll leave the majority of my answer the same, however, until I can confirm further] Yes, this is undefined behavior. The C++ Standard, section 9.5.1, states:

In a union, at most one of the non-static data members can be active at any time, that is, the value of at most one of the non-static data members can be stored in a union at any time. [ Note: One special guarantee is made in order to simplify the use of unions: If a standard-layout union contains several standard-layout structs that share a common initial sequence (9.2), and if an object of this standard-layout union type contains one of the standard-layout structs, it is permitted to inspect the common initial sequence of any of standard-layout struct members; see 9.2. — end note ]

This means that only the most recently written to member can validly be read from as well (reading from the others is technically undefined behavior). Only one member of the union can be active at any time. Not two.

You might ask why? Consider your example. C++ does not mandate the endianness of addr32. It could be big-endian, little-endian, or middle-endian. If you write to addr8, and then read from addr32, C++ cannot guarantee you'll get the right value out because of the endianness in this case. One one computer, it could be one value, and on another, it could be a different value. Hence, doing so (that is, writing to one member and reading a different one) is undefined behavior.

Edit: For those wondering what "active" means, the MSDN documentation on Unions states:

The active member of a union is the one whose value was most recently set, and only that member has a valid value.

Edit Edit: I had always thought the behavior of doing this was undefined, but now I'm not so sure after R. Martinho Fernandes's comments and answer and after re-reading the quote from MSDN. The value is certainly unspecified/undefined, but now I'm not so sure if the behavior is (undefined value means you might get different results back; undefined behavior means your system might crash, the two being different things). I'm going to consider this further and talk with others I know to see if I can find a more explicit answer.

I do think it's safe to say, however, that in general reading an inactive member in a union can be undefined behavior (except for the special note in the Standard, of course), but I don't know if it always is (i.e. there may be some exceptions beyond the special note in the section of the C++ Standard I've quoted).

Cornstalks
  • 37,137
  • 18
  • 79
  • 144
5

Basically because in C++ you are allowed to access just the active member of an union.

This means that if you set addr8 then you should access just that one until you set addr32, so that you can access it and so on. Setting one member to access data from another one is what should cause undefined behavior.

A member is considered active when you set it, and it remains so until another one becomes the active one.

Jack
  • 131,802
  • 30
  • 241
  • 343
  • out of curiosity: is this also the case for C or is there no similar notion of an _active member_? – moooeeeep Apr 22 '12 at 21:18
  • 1
    @moooeeeep Consensus appears to be that this is defined behavior in C11 (definitely) and C99 (probably). See http://stackoverflow.com/questions/11639947/ – Nemo Apr 18 '13 at 23:15
5

Frankly, I can't find any mention in the standard that doing this is undefined behaviour. The standard does define the notion of "active member" for unions, but it doesn't seem to use that idea for anything other than explaining how to change the active member (§9.5p4), and to define constant expressions (§5.9p2). Specifically it doesn't seem to make explicit mention of validity of accessing either the active or the non-active members.

As far as I can see, something like the following can cause a strict aliasing violation, which is undefined behaviour:

union example0 {
    short some_other_view[sizeof(double)/sizeof(short)];
    double value;
};

This doesn't lead to strict aliasing violations because of some special rule for unions. It happens if you access the same memory location using types that can't be aliased, i.e., a "normal" strict aliasing violation.

But, since there's an exception for char when it comes to aliasing rules, the following does not lead to the same kind of violations:

union example1 {
    char byte_view[sizeof(double)];
    double value;
};

As far as I can see, there is nothing in the standard that leaves the following code with undefined behaviour:

example1 e;
e.value = 10.0;
std::out << e.byte_view[0];
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • `e.byte_view[0]` does not have a well-defined value, in this case, however. Compiling and running the code on one system and then on another may not give the same results. I think it's worth pointing this out. – Cornstalks Apr 22 '12 at 21:54
  • @Cornstalks It has an *unspecified value*. It means the implementation is free to pick the value *but everything else* has to work as written. Unlike *undefined behaviour*, which means the compiler can order pizza. – R. Martinho Fernandes Apr 22 '12 at 21:57
  • I just edited my comment after I walked away and had something click seconds before you replied, and for the most part it agrees with your most recent comment. I think you have made a decent point though, and now I'm reconsidering my thinking. I'll continue searching for a definitive answer. – Cornstalks Apr 22 '12 at 22:04
  • This raises the question whether something is undefined behavior if it is not explicitly defined behavior? – moooeeeep Apr 22 '12 at 22:30
  • @moooeeeep Surely accessing a member is defined (I'm not going to bother finding proof of that: if it's not, it's a defect). The standard just doesn't seem to use the "active member" concept for that, other than in constant expressions. – R. Martinho Fernandes Apr 22 '12 at 22:49
  • @R.MartinhoFernandes: At least in C, accessing a member is only defined if the member happens to have a character type. Whether that's a defect depends upon whether the Standard is "supposed" to fully define everything necessary to make the language useful. Given that the authors explicitly acknowledge that it's possible for a "conforming" implementation to be of such poor quality as to be basically useless, there's no contradiction between "A poor-quality implementation could order pizza whenever a program accesses a union member of non-character type without being non-conforming, and... – supercat Jul 11 '18 at 21:00
  • ..."Any quality implementations that should be viewed as suitable for general-purpose use will process union-member accesses by reinterpreting the appropriate bytes of the structure; implementations that do otherwise are not high quality implementations that should be viewed as suitable for general-purpose use." – supercat Jul 11 '18 at 21:01