10

The following is not undefined behavior in modern C:

union foo
{
    int i;
    float f;
};
union foo bar;
bar.f = 1.0f;
printf("%08x\n", bar.i);

and prints the hex representation of 1.0f.

However the following is undefined behavior:

int x;
printf("%08x\n", x);

What about this?

union xyzzy
{
    char c;
    int i;
};
union xyzzy plugh;

This ought to be undefined behavior since no member of plugh has been written.

printf("%08x\n", plugh.i);

But what about this. Is this undefined behavior or not?

plugh.c = 'A';
printf("%08x\n", plugh.i);

Most C compilers nowadays will have sizeof(char) < sizeof(int), with sizeof(int) being either 2 or 4. That means that in these cases, at most 50% or 25% of plugh.i will have been written to, but reading the remaining bytes will be reading uninitialized data, and hence should be undefined behavior. On the basis of this, is the entire read undefined behavior?

hat
  • 781
  • 2
  • 14
  • 25
dgnuff
  • 3,195
  • 2
  • 18
  • 32
  • why `int x; printf("%08x\n", x);` an UB ? Casting an int to an unsigned int is a behavior defined, and not initialyzing a variable is not an UB, so why this code end up in a UB ? – Tom's Sep 12 '18 at 08:15
  • 1
    @Tom's - Accessing indeterminate values is UB. Point blank. – StoryTeller - Unslander Monica Sep 12 '18 at 08:16
  • 2
    @Tom's There are no casts in that line, and why do you think using an uninitialized variable is not UB? – melpomene Sep 12 '18 at 08:16
  • Because the behavior is defined ... Well I agree that this code will end up printing "garbage/random" value, but it will never crash are behaved differently. And there is a "cast" thougth it's really implicite : printf %x wait an unsigned int, and an int was given? – Tom's Sep 12 '18 at 08:18
  • 2
    @Tom's - *"but it will never crash are behaved differently"* The C standard, which is the subject here, does not guarantee anything of the sort. That's the whole point in being undefined behavior. – StoryTeller - Unslander Monica Sep 12 '18 at 08:19
  • 1
    @Tom's A cast is an explicit type conversion. There's no cast here. There isn't even an implicit conversion because varargs doesn't give you a known type context. Or do you think `printf("%f", 42)` is fine because `42` can be implicitly converted to `double`? – melpomene Sep 12 '18 at 08:21
  • @StoryTeller Strange. I do not see how an unitialyzed variable (which is not a pointer) can cause a different behavior. Thanks for the info. – Tom's Sep 12 '18 at 08:22
  • I'm not convinced this is a duplicate. The second piece of quoted text in Shafik Yaghmour's answer notes that " ... the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type ...". The problem here is that there aren't enough bits in a char to create an int, therefore typically 8 or 24 bits of the int will be uninitialized. As @StoryTeller has noted unitialized access is UB. – dgnuff Sep 12 '18 at 08:23
  • @Tom's https://stackoverflow.com/questions/6725809/trap-representation – melpomene Sep 12 '18 at 08:24
  • @melpomene thanks, I will read that. – Tom's Sep 12 '18 at 08:28
  • @dgnuff Perhaps out of common courtesy you might want to accept one of the comprehensive answers below? – Bathsheba Sep 12 '18 at 10:20
  • "The following is not UB in modern C:" no one said that actually the first exemple is completely UB ? – Stargateur Sep 12 '18 at 16:01
  • 1
    @Stargateur Although in [C the wording has changed a lot](http://blog.frama-c.com/index.php?post/2013/03/13/indeterminate-undefined) reading an indeterminate value is undefined behavior with some caveats. Thankfully for C++, [C++14 nailed it down more concisely](https://stackoverflow.com/q/23415661/1708801) – Shafik Yaghmour Sep 12 '18 at 16:15
  • @ShafikYaghmour Why force language to define thing that shouldn't ever be write, people are not suppose to use union like that and I completely agree. If OP want look at the byte of a float [this](http://rextester.com/AAIHI96066) is perfectly defined. Plus there is simply no evidence an int is the same size that a float. – Stargateur Sep 12 '18 at 16:25
  • 2
    @Stargateur type punning via a union is well defined in C although I would just use memcpy and bit_cast in C++ see [my answer here for more details](https://stackoverflow.com/a/51228315/1708801). I personally [feel that unions are meant for variant types](https://stackoverflow.com/a/31080901/1708801) but that boat left a long time ago. – Shafik Yaghmour Sep 12 '18 at 16:27
  • @ShafikYaghmour No it's not ! And whatever don't push people to do this – Stargateur Sep 12 '18 at 16:31
  • @Stargateur: If a program wants to extract the exponent from a `double`, the level of compiler complexity required to support reading and writing a superimposed `uint16_t` or `uint32_t` would be far less than the level of compiler complexity required to recognize all reasonable code patterns via which a program might assemble bytes of a float into a longer integer type and later decompose that type into a sequence of bytes, and convert those patterns into a single 16-or-32-bit read and a single such write. – supercat Sep 12 '18 at 17:13
  • @Stargateur "Accessing indeterminate values is UB" - wrong – M.M Sep 13 '18 at 00:18

5 Answers5

11

Defect report 283: Accessing a non-current union member ("type punning") covers this and tells us there is undefined behavior if there is trap representation.

The defect report asked:

In the paragraph corresponding to 6.5.2.3#5, C89 contained this sentence:

With one exception, if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined.

Associated with that sentence was this footnote:

The "byte orders" for scalar types are invisible to isolated programs that do not indulge in type punning (for example, by assigning to one member of a union and inspecting the storage by accessing another member that is an appropriately sixed array of character type), but must be accounted for when conforming to externally imposed storage layouts.

The only corresponding verbiage in C99 is 6.2.6.1#7:

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values, but the value of the union object shall not thereby become a trap representation.

It is not perfectly clear that the C99 words have the same implications as the C89 words.

The defect report added the following footnote:

Attach a new footnote 78a to the words "named member" in 6.5.2.3#3:

78a If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.

C11 6.2.6.1 General tells us:

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined.50) Such a representation is called a trap representation.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • Note I quote the defect report specifically because footnotes are non-normative, while a defect report we can see the rationale and have more confidence that we don't have an error, which does happen in non-normative sections from time to time. – Shafik Yaghmour Sep 12 '18 at 16:24
4

From 6.2.6.1 §7 :

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.

So, the value of plugh.i would be unspecified after setting plugh.c.

From a footnote to 6.5.2.3 §3 :

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.

This says that type punning is specifically allowed (as you asserted in your question). But it might result in a trap representation, in which case reading the value has undefined behavior according to 6.2.6.1 §5 :

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. 50) Such a representation is called a trap representation.

If it's not a trap representation, there seems to be nothing in the standard that would make this undefined behavior, because from 4 §3, we get :

A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3.

Sander De Dycker
  • 16,053
  • 1
  • 35
  • 40
3

Other answers address the main question of whether reading plugh.i produces undefined behavior when plugh was not initialized and only plugh.c was ever assigned. In short: no, unless the bytes of plugh.i constitute a trap representation at the time of the read.

But I want to speak directly to a preliminary assertion in the question:

Most C compilers nowadays will have sizeof(char) < sizeof(int), with sizeof(int) being either 2 or 4. That means that in these cases at most 50% or 25% of plugh.i will have been written to

The question seems to be supposing that assigning a value to plugh.c will leave undisturbed those bytes of plugh that do not correspond to c, but in no way does the standard support that proposition. In fact, it expressly denies any such guarantee, for as others have noted:

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.

(C2011, 6.2.6.1/7; emphasis added)

Although this does not guarantee that the unspecified values taken by those bytes are different from their values prior to the assignment, it expressly provides that they might be. And it is entirely plausible that in some implementations they often will be. For example, on a platform that supports only word-sized writes to memory or where such writes are more efficient than byte-sized ones, it is likely that assignments to plugh.c are implemented with word-sized writes, without first loading the other bytes of plugh.i so as to preserve their values.

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • I think the concept of [wobbly values](https://stackoverflow.com/a/31746063/1708801) encompasses this issue. – Shafik Yaghmour Sep 12 '18 at 16:39
  • @ShafikYaghmour: There are some situations where a guarantee that writing to a struct member will not disturb any storage outside that member would be useful, and working around the lack of such a guarantee would be expensive; there are others where such a guarantee would be expensive. There are likewise platforms where such a guarantee could be upheld at almost no cost, and others where it would be expensive. I don't think the authors of the Standard intended the license above to be applied except in cases where an implementer judged that the latter expense would exceed the former for... – supercat Sep 12 '18 at 17:07
  • ...the kinds of tasks the implementation was intended to serve, but unfortunately the Standard provides no means by which programs which require that guarantee can safely refuse to run on implementations that don't provide it. – supercat Sep 12 '18 at 17:08
  • @ShafikYaghmour, as far as I understand what the term is supposed to mean, I don't think "wobbly values" are needed to understand what I'm describing. That is, whether the unspecified values taken by some bytes of `plugh.i` upon assignment to `plugh.c` are furthermore wobbly is an additional, separate consideration. – John Bollinger Sep 12 '18 at 17:20
  • @JohnBollinger: In some cases, the only way to achieve optimal performance would be to recognize the concept of "wobbly bytes". Suppose, for example, that code writes `unionArray[i].struct1.member2`, `unionArray[j].struct2`, and `unionArray[i].struct1.member1` in that order, and then returns `unionArray[i].struct1`. I think that sequence should have defined behavior if nothing beyond the first member of the returned structure or `unionArray[i].struct1` is ever examined, and see no basis for such examination via character type to invoke UB, but see no basis other than "wobbly values" for... – supercat Sep 12 '18 at 18:21
  • ...the returned structure not to match the final value of `unionArray[i].struct1` unless the return or the observation invokes UB, or there's some kind of "wobbly value" at work. – supercat Sep 12 '18 at 18:22
  • @supercat, I agree that in your scenario there is no justification other than wobbly values (which seem not to have made it into C17), or UB as you describe, for the structure value returned from the function to differ from the final value of `unionarray[i].struc1`. The question I'm considering in this answer, however, is what the standard says about what that value is in the first place, in the event that `i == j`. Without any other details, all I can say with certainty is that its `member1` will have the value set in the third operation. – John Bollinger Sep 12 '18 at 19:01
  • @JohnBollinger: The Standard clearly does not require that unused members take on any specific bit pattern. What is ambiguous is whether they are required to behave as though they hold *some* consistent bit pattern. I think I've come up with a simplified example, which I offer in my answer. – supercat Sep 12 '18 at 23:31
1

C11 §6.2.6.1 p7 says:

When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.

So, plugh.i would be unspecified.

msc
  • 33,420
  • 29
  • 119
  • 214
0

In cases where useful optimizations might cause some aspects of a program's execution to behave in a fashion inconsistent with the Standard (e.g. two consecutive reads of the same byte yielding inconsistent results), the Standard generally attempts to characterize situations where such effects might be observed, and then classify such situations as invoking Undefined Behavior. It doesn't make much effort to ensure that its characterizations don't "ensnare" some actions whose behavior should obviously be processed predictably, since it expects compiler writers to avoid behaving obtusely in such cases.

Unfortunately, there are some corner cases where this approach really doesn't work well. For example, consider:

struct c8 { uint32_t u; unsigned char arr[4]; };
union uc { uint32_t u; struct c8 dat; } uuc1,uuc2;

void wowzo(void)
{
  union uc u;
  u.u = 123;
  uuc1 = u;
  uuc2 = u;
}

I think it's clear that the Standard does not require that the bytes in uuc1.dat.arr or uuc2.dat.arr contain any particular value, and that a compiler would be allowed to, for each of the four bytes i==0..3, copy uuc1.dat.arr[i] to uuc2.dat.arr[i], copy uuc2.dat.arr[i] to uuc1.dat.arr[i], or write both uuc1.dat.arr[i] and uuc2.dat.arr[i] with matching values. I don't think it's clear whether the Standard intends to require that a compiler select one of those courses of action rather than simply leaving those bytes holding whatever they happen to hold.

Clearly the code is supposed to have fully defined behavior if nothing ever observes the contents of uuc1.dat.arr nor uuc2.dat.arr, and there's nothing to suggest that examining those arrays should invoke UB. Further, there is no defined means via which the value of u.dat.arr could change between the assignments to uuc1 and uuc2. That would suggest that the uuc1.dat.arr and uuc2.dat.arr should contain matching values. On the other hand, for some kinds of programs, storing obviously-meaningless data into uuc1.dat.arr and/or uuc1.dat.arr would seldom serve any useful purpose. I don't think the authors of the Standard particularly intended to require such stores, but saying that the bytes take on "Unspecified" values makes them necessary. I'd expect such a behavioral guarantee to be deprecated, but I don't know what could replace it.

supercat
  • 77,689
  • 9
  • 166
  • 211