6

I read that this code it is undefined according to the c standard but I cant find why. It is compiles without errors in gcc 8.1.0 and clang-6.0 and prints 1.

the code is as follows:

#include <stdio.h>

int main()
{
   union {
     int i;
     short s;
   } u;  
   u.i = 42;
   u.s = 1;

   printf("%d\n", u.i);
   return 0;
}
dbush
  • 205,898
  • 23
  • 218
  • 273
theriver
  • 325
  • 2
  • 9
  • 2
    @tadman It is not undefined behavior. It is specifically allowed as of C99, and a footnote added in C11 clarifying the legality of it. – Christian Gibbons Oct 03 '18 at 17:29
  • 1
    @ChristianGibbons If you have a citation that'd help. I can't see how you could define what would happen to `i` here. – tadman Oct 03 '18 at 17:30
  • @tadman See Some programmer dude's answer. He got to it faster than me. – Christian Gibbons Oct 03 '18 at 17:33
  • @ChristianGibbons Thanks for the note. Pulled my answer. – tadman Oct 03 '18 at 17:52
  • Based on the wording of 6.2.6.1/5 and footnote 95, it's undefined *unless* one of the members is of character type, which I interpret to mean either a single character or an array. There's no such thing as a trap representation in `char` or `unsigned char`, so you should be able to safely write to a member of type `T` and read from a member of type `char` or `char [N]`. For any other types, all bets are off. – John Bode Oct 03 '18 at 17:59
  • @ChristianGibbons: If some part(s) of the Standard defines the behavior of some action, but another part says it's undefined, the latter takes precedence (though compilers would be free to extend the language by giving the former parts precedence in cases where that would be useful). The way N1570 6.5p7 is written, any attempt to access a union object via means other than an lvalue of union type or character type invokes UB. Since 6.5p7 includes no provision for accessing unions via lvalues of member type, support for such accesses is not part of the Standard, but merely a popular extension. – supercat Oct 03 '18 at 22:03

3 Answers3

8

From the C11 specification, §6.5.2.3 note 95:

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.

This says that what you're doing is allowed, but also implies that the value you read may not be what you expect (for example by writing to an int member and reading from a float member).

There's also the caveat about trap representation values in which case the behavior will be undefined. For two's complement systems (which is the vast majority of all computers the last couple of decades) this isn't an issue with integer values though.


In your case the result will depends very much on the platforms endianness. Either you will get the value you write (1) or you will get 0.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • 2
    I could imagine an exotic one's-complement system with "negative zero" being a trap. – Christian Gibbons Oct 03 '18 at 17:35
  • 1
    @ChristianGibbons Oh those pesky esoteric platforms always throw a wrench in everything... :) – Some programmer dude Oct 03 '18 at 17:40
  • 1
    Some platforms just want to watch the world burn. – Christian Gibbons Oct 03 '18 at 17:41
  • 1
    one’s-complement makes multiplication and division a bit cheaper, if there’s no hardware support for those, and also makes long mul/div cheaper if it’s implemented in software. – Kuba hasn't forgotten Monica Oct 03 '18 at 17:44
  • in a powerpc prints 65578 – theriver Oct 03 '18 at 17:54
  • @theriver While PowerPC have configurable endianness (IIRC) most are using bid-endian. That means your `1` would be stored as `0x01 0x00` in the 16-bit `short` value, and in the 32-bit `int` it would be 0x01 0x00 xx xx` (with `xx` most likely being `0` due to the previous assignment). That means the value of `i` would be `1`. If you get something else then are you sure you're showing us the actual code you're running? With the actual values? – Some programmer dude Oct 03 '18 at 18:58
  • @Someprogrammerdude my power pc is big endian and the value in memory is 0x00 0x01 0x00 0x2a – theriver Oct 04 '18 at 15:31
  • Ah yeah damn I mixed it up! I'm sorry, you're correct and the values you're telling are also correct. The assignment `u.i = 42` will set all the memory to `0x00 0x00 0x00 0x2a`. Then `u.s = 1` will set the first two bytes to `0x00 0x01` leaving you with exactly the results you're seeing. – Some programmer dude Oct 04 '18 at 16:03
  • @Someprogrammerdude In little endian u.i=32 **0x2a 0x00 0x00 0x00** and u.s=1 **0x01 0x00 0x00 0x00** why 0x2a dissapear in memory? – theriver Oct 04 '18 at 16:22
  • @theriver Remember that in a union all members *share* the memory. The `short` member `s` is overlying the `int` member `i` in memory. – Some programmer dude Oct 04 '18 at 16:24
  • @Someprogrammerdude yes, but in big endian 0x2a is in memory and little endian 0x2a dissapear, why? – theriver Oct 04 '18 at 16:29
  • @theriver Again, remember that the memory is *shared* and that the member *overlays* each other in memory. Writing to `i` will modify all four bytes allocated for the union. Writing to `s` will only modify the two bytes needed for `s`, leaving the other two bytes untouched. For both little- and big-endian the bytes being untouched is the last two bytes, while the first two will be overwritten. – Some programmer dude Oct 04 '18 at 16:35
  • @theriver Try this: On a piece of paper draw four squares next to each other with a pencil. Then in the first square write `0x2a`, in the second to fourth write `0x00`. Now, erase the numbers in the first two squares, and write `0x01` and `0x00`. That's basically what happens here on a little-endian system. Is the value `0x2a` still there? No, it's been erased and overwritten by `0x01`. – Some programmer dude Oct 04 '18 at 16:38
  • @Someprogrammerdude okay, but big endian 0x2a it is not been erased and overwritten by 0x01 opposite little endian, why? memory in big endian **x00 0x01 0x00 0x2a** – theriver Oct 04 '18 at 16:52
  • @theriver Draw that out on paper as well. Notice that then the `0x2a` is in the *last* square? While you still erase the two *first*? – Some programmer dude Oct 04 '18 at 19:46
  • I finally understood the difference, thank you for explaining to me @Someprogrammerdude – theriver Oct 04 '18 at 20:32
3
union {
    int i;
    short s;
  } u;  
u.i = 42;
u.s = 1;`

What happens when you assign a value to u.i that's larger than a short can hold? For example, try this:

u.i = 40000;
u.s = 1;

Should the compiler clear out the entire space reserved for u before assigning the short, or should it just write the bytes needed to store the new value? Since it's your responsibility to keep track of how to interpret the value stored in u, storing one type and then reading a different type of a different size seems like a poor plan.

Caleb
  • 124,013
  • 19
  • 183
  • 272
2

Writing to one member of a union and reading from another is referred to as type punning and is allowed by the standard.

This is spelled out in section 6.5.2.3:

3 A postfix expression followed by the . operator and an identifier designates a member of a structure or union object. The value is that of the named member, 95) and is an lvalue if the first expression is an lvalue. If the first expression has qualified type, the result has the so-qualified version of the type of the designated member.

95) If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.

dbush
  • 205,898
  • 23
  • 218
  • 273
  • The expression `u.i` is an lvalue of type `int`. According to N1570 6.5p7, union object `u` can only be accessed with an lvalue of `u`'s type or a character type; `u.i` is neither. So far as I can tell, the ability to use non-character-type lvalue like `u.i` to access `u` isn't part of the Standard, but merely a (very) popular extension. – supercat Oct 03 '18 at 22:08