4

When working with char buffers in C, sometimes it would be useful and more efficient to able to work with int-sized chunks of data at a time. To do this I can cast my char * to an int * and use that pointer instead. However I'm not entirely confident that this works the way I think it does.

For example, suppose I have char *data, does *(int32_t *)data = -1 always overwrite the bytes data[0], data[1], data[2] and data[3] and no other bytes?

Shum
  • 1,236
  • 9
  • 22
  • 3
    I think that also violates strict-aliasing. You're allowed to alias any datatype with `char*`, but not the other way around. – Mysticial Jan 30 '12 at 07:52
  • Have you tried it? Did it work as you expected? Also, when setting a buffer to a number it is normally better to use functions such as `memset` as they are already optimized with techniques similar to this. – Some programmer dude Jan 30 '12 at 07:56
  • 1
    You should use an `union` instead. Example: `union { int data; char buffer[sizeof(int)]; };`. Also pay attention to endianness and alignment. – jweyrich Jan 30 '12 at 08:03
  • 2
    @JoachimPileborg, trying a particular case on a particular system is not a great way of learning about undefined behavior. – Matthew Flaschen Jan 30 '12 at 08:05
  • 2
    @jweyrich Surprisingly, type-punning through a union like that also has undefined results and sometimes if you write through `data` then a subsequent read from `buffer` might not give you the right result. That burned me once on a particular compiler. You need to use memcpy instead. (Yes, it's weird -- it always worked fine for us in MSVC but then the code mysteriously blew up when using an esoteric vendor-supplied compiler.) – Crashworks Jan 30 '12 at 08:05
  • @Crashworks: writing to a member of a union, and reading from another is UB. One should only read from the last written member. – jweyrich Jan 30 '12 at 08:07
  • 2
    @Crashworks, jweyrich: C99 explicitly allows type-punning through unions - see http://stackoverflow.com/a/8513748/48015 – Christoph Jan 30 '12 at 10:00
  • @Christoph: oh, how did I miss that for so long? TYVM! :) – jweyrich Jan 30 '12 at 19:42
  • @Christoph That makes me even more sad that [HARDWARE MANUFACTURER] still hasn't caught up with their compiler. =( – Crashworks Jan 30 '12 at 22:08

3 Answers3

5

Expanding on my comment.

There are two major issues here:


Violating strict-aliasing is technically undefined behavior. You are allowed to alias any datatype with char*, but not the other way around.

You can get around the issue with f[no-]strict-aliasing on GCC.


The other issue is alignment. The char pointer might not be properly aligned. Accessing misaligned data may lead to performance degradation or even a hardware exception if the hardware doesn't support misaligned access.


If performance isn't an issue, the full-proof way is to memcpy() to an int array buffer.

Once these two issues are resolved, your example with:

*(int32_t *)data = -1

overwriting data[0], data[1], data[2], and data[3] should work as expected if sizeof(int32_t) == 4. Just pay attention to the endianness...

Community
  • 1
  • 1
Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • I think you hit the right points, perhaps missing the question of what value is actually put there. @Shum might be expecting all bytes to get the value -1, which will happen in a normal system, for the specific case of -1, but is certainly not guaranteed in general. – ugoren Jan 30 '12 at 08:03
  • @Crashworks, If you can't disable strict-aliasing, then it gets tricky... :( This is a common problem that I actually ignore when vectorizing. (eg. `__m128d*` will alias with `double*`) – Mysticial Jan 30 '12 at 08:11
  • @Crashworks: if performance is an issue, then you might consider non-portable solutions. For example, with GCC you might go ahead with type punning and use the `-fno-strict-aliasing` option (that's what the Linux kernel does). Or - again with GCC - you could choose to use the union 'workaround', which GCC explicitly supports, even when `-fstrict-aliasing` is in force. But, you need to realize that these workarounds are not portable. – Michael Burr Jan 30 '12 at 08:20
3

This is technically undefined behavior and the standard is silent on the results of aliasing pointers like this. A standards pedant would say that invoking undefined behavior in this way could result in anything from corrupted data to a system crash to Ragnarok.

Pragmatically, this depends on your hardware. Most modern systems (eg x86, x64, PPC, MIPS, ARM) handle word-sized writes in the way you describe, with the exception that writing to an unaligned address will result in a crash. Also, this is when endianness comes into play; on a little endian system

char foo[4];
*((uint_32*)(foo)) = 0x01020304;
// the following are now true:
foo[0] == 0x04;
foo[1] == 0x03;
foo[2] == 0x02;
foo[3] == 0x01;

The short answer is that this isn't safe unless you know exactly what hardware your program will run on.

If you do control the hardware you compile for, then you can predict what the compiler will do; I've used this trick to speed up packing of byte arrays on embedded systems.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
1

No, not necessarily. If the data isn't aligned correctly, it might not work at all. Assuming it's aligned correctly, it'll probably overwrite the next sizeof(int) bytes and nothing else, but I'm not sure even that much is entirely guaranteed.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • *If the data isn't aligned correctly* > can you please tell me what is alignment and why is it mostly related to types etc? , is it have to do with endianess? – Mr.Anubis Jan 30 '12 at 08:16
  • Alignment is mostly a hardware artifact. For example, a lot of hardware requires that when you load a 32-bit value that it be aligned to a value that's a multiple of 4 bytes -- i.e., that the 3 least significant bits of the address are all zeros. Other hardware doesn't require that, but still gains performance when it's true. – Jerry Coffin Jan 30 '12 at 17:53
  • Just a minor correction: alignment to 4 bytes means the 2 least significant bits are zero. 3 bits would indicate unsigned values from 0 to 7, right? =) – t0rakka Nov 16 '16 at 00:06