12

Does accessing union members via a pointer, as in the example below, result in undefined behavior in C99? The intent seems clear enough, but I know that there are some restrictions regarding aliasing and unions.

union { int i; char c; } u;

int  *ip = &u.i;
char *ic = &u.c;

*ip = 0;
*ic = 'a';
printf("%c\n", u.c);
Dara Hazeghi
  • 123
  • 1
  • 1
  • 5

3 Answers3

16

It is unspecified (subtly different from undefined) behaviour to access a union by any element other than the one that was last written. That's detailed in C99 annex J:

The following are unspecified:
   :
   The value of a union member other than the last one stored into (6.2.6.1).

However, since you are writing to c via the pointer, then reading c, this particular example is well defined. It does not matter how you write to the element:

u.c = 'a';        // direct write.
*(&(u.c)) = 'a';  // variation on yours, writing through element pointer.
(&u)->c = 'a';    // writing through structure pointer.

There is one issue that has been raised in comments which seems to contradict that, at least seemingly. User davmac provides sample code:

// Compile with "-O3 -std=c99" eg:
//  clang -O3 -std=c99 test.c
//  gcc -O3 -std=c99 test.c
// On clang v3.5.1, output is "123"
// On gcc 4.8.4, output is "1073741824"
//
// Different outputs, so either:
// * program invokes undefined behaviour; both compilers are correct OR
// * compiler vendors interpret standard differently OR
// * one compiler or the other has a bug

#include <stdio.h>

union u
{
    int i;
    float f;
};

int someFunc(union u * up, float *fp)
{
    up->i = 123;
    *fp = 2.0;     // does this set the union member?
    return up->i;  // then this should not return 123!
}

int main(int argc, char **argv)
{
    union u uobj;
    printf("%d\n", someFunc(&uobj, &uobj.f));
    return 0;
}

which outputs different values on different compilers. However, I believe that this is because it is actually violating the rules here because it writes to member f then reads member i and, as shown in Annex J, that's unspecified.

There is a footnote 82 in 6.5.2.3 which states:

If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type.

However, since this seems to go against the Annex J comment and it's a footnote to the section dealing with expressions of the form x.y, it may not apply to accesses via a pointer.

One of the major reasons why aliasing is supposed to be strict is to allow the compiler more scope for optimisation. To that end, the standard dictates that treating memory of a different type to that written is unspecified.

By way of example, consider the function provided:

int someFunc(union u * up, float *fp)
{
    up->i = 123;
    *fp = 2.0;     // does this set the union member?
    return up->i;  // then this should not return 123!
}

The implementation is free to assume that, because you're not supposed to alias memory, up->i and *fp are two distinct objects. So it's free to assume that you're not changing the value of up->i after you set it to 123 so it can simply return 123 without looking at the actual variable contents again.

If instead, you changed the pointer setting statement to:

up->f = 2.0;

then that would make footnote 82 applicable and the returned value would be a re-interpretation of the float as an integer.

The reason why I don't think that's an issue for the question is because your writing then reading the same type, hence aliasing rules don't come into play.


It's interesting to note that the unspecified behaviour is caused not by the function itself, but by calling it thus:

union u up;
int x = someFunc (&u, &(up.f)); // <- aliasing here

If you were instead to call it so:

union u up;
float down;
int x = someFunc (&u, &down); // <- no aliasing

that would not be a problem.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • I don't believe this is correct, because the line `*ic = 'a';` in the OPs question does not actually write to the 'c' member of the union. A union can only contain one member at a time (C99 footnote 37 from 6.2.5p22); since the active member has not been set (or has arguably been set to `i` by the previous line) then `ic` does not refer to an object, and de-referencing pointers is only defined if they point at an object (6.5.3.2p4). AFAIK this is the consensus understanding amongst compiler implementors. – davmac Apr 24 '15 at 14:34
  • @davmac, it _is_ writing to `u.c` via the `ic` pointer. You cannot say it's arguably been set to `i` by the previous write but not set to `c` by the next. The act of writing to `c` marks `c` as the active member. The standard states you can't read from a non-active member, but you can change the active member at any point just by writing to it. What the question has in no different to `u.i = 0; u.c = 'a';`. – paxdiablo Apr 24 '15 at 15:59
  • In other words, writing to a member (directly or through a valid pointer) _makes_ it the active member. – paxdiablo Apr 24 '15 at 16:01
  • "In other words, writing to a member (directly or through a valid pointer) makes it the active member" - can you justify this with reference to the standard? In particular why is it that reading a member via the union object should have different effect than reading it via a member pointer, but writing it via the union object or member pointer has the same effect? The GCC docs for "-fstrict-aliasing" - https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options - say "Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type" – davmac Apr 25 '15 at 08:52
  • Example program: http://pastebin.com/6jFwHx9C According to your interpretation here, the program is well-defined, but it gives different output from two different compilers (gcc 4.8.4 and LLVM's clang 3.5.1). – davmac Apr 25 '15 at 10:42
  • btw example relies on footnote 82 - "If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning")" (introduced in TC3). You _could_ argue that the requirement in footnote 82 applies only when both accesses are through the union object and the "active member" if set via pointer has no bearing. I'll upvote if you amend answer as such. – davmac Apr 25 '15 at 12:06
  • @davmac, your code is flawed. You write the float then read the integer and that _is_ undefined behaviour. You say it shouldn't be 123 but in reality it could be anything, including 123. – paxdiablo Apr 25 '15 at 12:56
  • @paxdiablo you said that storing a union member via a pointer is permissible. My code does this, and then reads from a different union member (but the same union object). I have pointed out that footnote 82 legitimises this form of type punning and specifies the behavior (although certainly the resulting value is implementation dependent). Kindly explain the flaw. (And by the way, I think you meant _unspecified_ behavior - it is certainly not undefined). – davmac Apr 25 '15 at 14:32
  • (To clarify, if the store to the union member via the pointer is legitimate as you claim, and the type punning is legitimate, then the value returned from the function should be the float value 2.0 re-interpreted as an int. That is the case with GCC, but I think it's only a coincedence. LLVM/clang OTOH clearly believes that the write via `fp` cannot alias the union member `f`). – davmac Apr 25 '15 at 15:16
  • @davmac, okay, I think I see where you're coming from now - standards docs are second only to patents in terms of difficulty in understanding :-) I _did_ mean unspecified and I believe you're correct in the limitations of footnote 82. Since it's a footnote to the para that states how to interpret `x.y`, I suspect it doesn't cover a pointer to `x.y`. I'll update the answer with some more "meat". – paxdiablo Apr 26 '15 at 05:19
  • @paxdiablo Thanks, and you're right about the complexity of the standards and C99 in particular is nasty. (Even this case is worse than it first appears - footnote 82, being a footnote, is not meant to be prescriptive or "normative" as the ISO people call it - that is, you are supposed to be able to determine that same behavior without reading the footnote. To me this is not evident at all). – davmac Apr 26 '15 at 09:43
  • Accessing a union using a member other than the last one written to wasn't supposed to be unspecified behavior, that was just a defect in the standard: http://stackoverflow.com/a/8513748/371250 http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm – ninjalj Dec 10 '15 at 23:00
  • 1
    @ninjalj: The effective type rules only "work" if reading a member other than the last one written yields, at best, unspecified results. Otherwise the rules would need to provide that the effective type of a union can change depending upon what's written to it, and they don't. – supercat Dec 10 '15 at 23:31
5

No, it won't but you need to keep track of what the last type you put into the union was. If I were to reverse the order of your int and char assignments it would be a very different story:

#include <stdio.h>

union { int i; char c; } u;

int main()
{
    int  *ip = &u.i;
    char *ic = &u.c;

    *ic = 'a';
    *ip = 123456;

    printf("%c\n", u.c); /* trying to print a char even though 
                            it's currently storing an int,
                            in this case it prints '@' on my machine */

    return 0;
}

EDIT: Some explanation on why it may have printed 64 ('@').

The binary representation of 123456 is 0001 1110 0010 0100 0000.

For 64 it is 0100 0000.

You can see that the first 8 bits are identical and since printf is instructed to read the first 8 bits, it prints only as much.

Nobilis
  • 7,310
  • 1
  • 33
  • 67
3

The only reason it's not UB is because you were lucky/unlucky enough to choose char for one of the types, and character types can alias anything in C. If the types were, for example, int and float, the accesses via pointers would be aliasing violations and thus undefined behavior. For direct access via the union, the behavior was deemed well defined as part of the interpretation for Defect Report 283:

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm

Of course, you still need to ensure that the representation of the type used for writing can also be interpreted as a valid (non-trap) representation for the type later used for reading.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • There is no aliasing in the code shown; it accesses the last member stored in the union. – Eric Postpischil May 29 '13 at 13:17
  • 3
    There is aliasing. The two writes via pointers could be reordered if not for the fact that one of them is of character type. – R.. GitHub STOP HELPING ICE May 29 '13 at 13:34
  • That needs to be stated and explained in the answer and supported with citations from the standard. – Eric Postpischil May 29 '13 at 13:45
  • R.., can you actually explain where aliasing comes into play here. The writes to both members are via pointers of the correct type, `int*` for `i` and `char*` for `c`. I understand there would be a problem if you tries to write `i` via a `float*` for example but if `int c` were actually `float f` and you wrote via a `float*`, it would still be valid as far as I know. Or am I missing something? – paxdiablo May 30 '13 at 03:47
  • @paxdiablo: Actually, I'm unclear now on whether the *writes* are forbidden. I think this is an issue that could use more research. – R.. GitHub STOP HELPING ICE May 30 '13 at 04:06
  • @R.: Writes count as access, per C 2011 (n1570) 3.1, which defines “access” to mean “to read or modify the value of an object.” I certainly see your point; the compiler may in some circumstances treat an `int *i` and a `float *f` as pointing to different objects, so `*i = 1; *f = 2;` is unordered due to the “as if” rule. But I am having trouble pinning this down in the standard. 6.5 7 sets rules for accessing an object. But pointers to union members seem to obey the rules; each accesses a member via a type compatible with the effective type of the member. This implies a compiler must support… – Eric Postpischil May 30 '13 at 13:26
  • … `*i = 1; *f = 2; printf(…, u.f)`, given that `i` and `f` point to the `int i` and `float f` members in `u`. But that seems to be an excessive constraint on the compiler, as it prevents the compiler from optimizing using the property that function parameters pointing to different types point to different memory. So I would be interested in seeing this resolved in the standard. – Eric Postpischil May 30 '13 at 13:29
  • @EricPostpischil: Read about effective types. Not all objects have a declared type (e.g. objects obtained by `malloc`), so the effective type might not be established until the point of a write. Though it's not clear (or at least I can't find anywhere it's clarified), I think the same principle probably applies with pointers to union members. I'm still not clear on whether this precludes reordering of writes, however, and whether OP's code with `s/char/float/` would have UB. – R.. GitHub STOP HELPING ICE May 30 '13 at 14:17
  • @EricPostpischil, I know this thread is old but: I believe consensus among compiler vendors is that changing the active member of a union requires writing via the union object, so you generally can't store to a non-active member via a pointer (if you do so, you're aliasing: the object that you access is the active member, rather than the one with the type you used for access). This consensus is not however directly supported by the wording in the standard. I hope that clears it up. – davmac Apr 25 '15 at 11:32