13

I'm currently working on a project to build a small compiler just for the heck of it.

I've decided to take the approach of building an extremely simple virtual machine to target so I don't have to worry about learning the ins and outs of elf, intel assembly, etc.

My question is about type punning in C using unions. I've decided to only support 32 bit integers and 32 bit float values in the vm's memory. To facilitate this, the "main memory" of the vm is set up like this:

typedef union
{    
    int i;
    float f;
}word;


memory = (word *)malloc(mem_size * sizeof(word));

So I can then just treat the memory section as either an int or a float depending on the instruction.

Is this technically type punning? It certainly would be if I were to use ints as the words of memory and then use a float* to treat them like floats. My current approach, while syntactically different, I don't think is semantically different. In the end I'm still treating 32 bits in memory as either an int or a float.

The only information I could come up with online suggests that this is implementation dependent. Is there a more portable way to acheive this without wasting a bunch of space?

I could do the following, but then I would be taking up more than 2 times as much memory and "reinventing the wheel" with respect to unions.

typedef struct
{
    int i;
    float f;
    char is_int;
}

Edit

I perhaps didn't make my exact question clear. I am aware that I can use either a float or an int from a union without undefined behavior. What I'm after is specifically a way to have a 32 bit memory location that I can safely use as an int or float without knowing what the last value set was. I want to account for the situation where the other type is used.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
David Mason
  • 1,545
  • 2
  • 14
  • 14
  • 4
    ...don't cast the return value of `malloc` in C... – Ed S. Jul 11 '12 at 22:52
  • 1
    @EdS. I'll point out that just because whitespace is ignored and unnecessary doesn't mean you shouldn't indent. There is nothing wrong with explicitly casting it, even if it will cast implicitly if you leave the cast out – Wug Jul 11 '12 at 22:59
  • 11
    @Wug: And I'll point out that you have proposed a ridiculous analogy. There is in fact something wrong with explicitly casting it. First, it is redundant and not idiomatic C. Secondly it may very well hide the fact that you forgot to include ``, in which case `malloc` will be assumed to be a function which returns int, and the cast hides the error. Even if that weren't the case, are you in favor of writing redundant and unnecessary code? C is not C++. – Ed S. Jul 11 '12 at 23:02
  • Yeah, I hate writing redundant and unnecessary spaces. I'll just leave everything zero indented from now on. (My point is that this is a style issue. Syntactically, both ways are correct.) – Wug Jul 11 '12 at 23:06
  • 7
    @Wug: Did you actually *read* what I wrote? It's not simply a style issue, and comparing indentation to redundant code is, I say again, a ridiculous comparison. White space is not useless, the cast is, and that cast can hide an error. I see from your profile that you know C++. That's great, but C is not C++, so if you're going to comment on C code I would do a little homework first. – Ed S. Jul 11 '12 at 23:07
  • If you forget to include the standard library you're going to run into a lot of other problems. Redundant is not an error. "Not idiomatic C" is not an error. I also took the liberty of fact checking myself before I posted even my first comment, and while I found reasons it's unnecessary, I didn't see anything saying outright "ITS WRONG." – Wug Jul 11 '12 at 23:09
  • Also, if you'll google the phrase "c malloc example", every example casts the return value. – Wug Jul 11 '12 at 23:11
  • 1
    @Wug No, the very first result for me is the wikipedia example which does it right. – Daniel Fischer Jul 11 '12 at 23:15
  • 1
    The wikipedia page is so kind as to do it both ways, and to include a section explaining the pros and cons of casting it. – Wug Jul 11 '12 at 23:19
  • What I said, @Wug, it does it right. – Daniel Fischer Jul 11 '12 at 23:20
  • A link to that wikipedia page rather than a bun fight might have been preferable a few comments ago: http://en.wikipedia.org/wiki/C_dynamic_memory_allocation – ChrisH Jul 11 '12 at 23:34
  • 7
    @Wug: Casting the return value of `malloc` is a archaism from the era when `malloc` returned `char *`. If the examples you see on Google cast the result, they were either written by dinosaurs or they are written in the context of C++. In C casting the result of `malloc` is a serious blunder. – AnT stands with Russia Jul 11 '12 at 23:56
  • 1
    @Wug: Ed is right. See the answers to this question: http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc. – Oliver Charlesworth Jul 11 '12 at 23:59

2 Answers2

18

Yes, storing one member of union and reading another is type punning (assuming the types are sufficiently different). Moreover, this is the only kind of universal (any type to any type) type punning that is officially supported by C language. It is supported in a sense that the language promises that in this case the type punning will actually occur, i.e. that a physical attempt to read an object of one type as an object of another type will take place. Among other things it means that writing one member of the union and reading another member implies a data dependency between the write and the read. This, however, still leaves you with the burden of ensuring that the type punning does not produce a trap representation.

When you use casted pointers for type punning (what is usually understood as "classic" type punning), the language explicitly states that in general case the behavior is undefined (aside from reinterpreting object's value as an array of chars and other restricted cases). Compilers like GCC implement so called "strict aliasing semantics", which basically means that the pointer-based type punning might not work as you expect it to work. For example, the compiler might (and will) ignore the data dependency between type-punned reads and writes and rearrange them arbitrarily, thus completely ruining your intent. This

int i;
float f;

i = 5;
f = *(float *) &i;

can be easily rearranged into actual

f = *(float *) &i;
i = 5;

specifically because a strict-aliased compiler deliberately ignores the possibility of data dependency between the write and the read in the example.

In a modern C compiler, when you really need to perform physical reinterpretation of one objects value as value of another type, you are restricted to either memcpy-ing bytes from one object to another or to union-based type punning. There are no other ways. Casting pointers is no longer a viable option.

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • I'm probably splitting hairs, but the GCC manual states the pun occurs based on the previous write access and the current read access. If I am parsing the manual correctly, you can perform multiple writes to the union and avoid punning during reads. (I'm not sure if that's feasible in practice, though). See the discussion of `-fstrict-aliasing` in [3.10 Options That Control Optimization](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html). – jww Aug 02 '15 at 11:22
  • 1
    `memcpy` is the other officially-safe way of type-punning in ISO C. I guess you're making a distinction from `union` because memcpy makes a copy, not reading the original bytes as a different type. – Peter Cordes Dec 05 '22 at 15:45
8

As long as you only access the member (int or float) which was most recently stored, there's no problem and no real implementation dependency. It's perfectly safe and well-defined to store a value in a union member and then read that same member.

(Note that there's no guarantee that int and float are the same size, though they are on every system I've seen.)

If you store a value in one member and then read the other, that's type punning. Quoting a footnote in the latest C11 draft:

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • 2
    On the (few) 64-bit systems I have experience with, `int` and `float` are both 32 bits wide. What systems have different sizes for them? – Daniel Fischer Jul 11 '12 at 23:01
  • I have not yet seen a platform upon which `int` was anything but a 32 bit quantity. Single precision floats are also defined to be 32 bits. That said, the only requirement for "int" is "at least 16 bits". Might be a good idea to have an assertion for sizeof(int) == sizeof(float) though, since ints may change sizes – Wug Jul 11 '12 at 23:04
  • The way I have it set up right now, I have no guarantee that they won't be accessed. I suppose I could just transfer the illegal/undefined behavior to the target of my vm, though. I just have an `add` and a `addf` instruction, and they will treat the memory values accordingly. So, there is nothing stopping the assembly code from loading up ints and adding them as floats. – David Mason Jul 11 '12 at 23:08
  • 1
    @DanielFischer: My mistake; *most* 64-bit systems have 32-bit int. The ones I've seen with 64-bit int have been Cray systems. – Keith Thompson Jul 11 '12 at 23:20
  • @Wug: My mistake; see my previous comment. (But I *have* seen systems with 64-bit `float`.) – Keith Thompson Jul 11 '12 at 23:21
  • How weird. I've only ever seen a 32 bit float and a 64 bit double. – Wug Jul 11 '12 at 23:22
  • 1
    I recommend using types like `int32_t` that have a guaranteed size on every platform instead of `int` for things like this. It will eliminate a variable from the equation. – bta Jul 11 '12 at 23:25
  • 1
    @Wug: The systems were Cray vector machines, which use(d) their own (non-IEEE) floating-point format. The word size is 64 bits; accessing smaller chunks of data is inefficient. – Keith Thompson Jul 11 '12 at 23:25
  • 2
    @bta: Alas, there's no `float32_t` - but an assertion that `sizeof (int32_t) == sizeof (float)` would be ok for the vast majority of systems. – Keith Thompson Jul 11 '12 at 23:26