9

A char is 1 byte and an integer is 4 bytes. I want to copy byte-by-byte from a char[4] into an integer. I thought of different methods but I'm getting different answers.

char str[4]="abc";
unsigned int a = *(unsigned int*)str;
unsigned int b = str[0]<<24 | str[1]<<16 | str[2]<<8 | str[3];
unsigned int c;
memcpy(&c, str, 4);
printf("%u %u %u\n", a, b, c);

Output is 6513249 1633837824 6513249

Which one is correct? What is going wrong?

avmohan
  • 1,820
  • 3
  • 20
  • 39

6 Answers6

15

It's an endianness issue. When you interpret the char* as an int* the first byte of the string becomes the least significant byte of the integer (because you ran this code on x86 which is little endian), while with the manual conversion the first byte becomes the most significant.

To put this into pictures, this is the source array:

   a      b      c      \0
+------+------+------+------+
| 0x61 | 0x62 | 0x63 | 0x00 |  <---- bytes in memory
+------+------+------+------+

When these bytes are interpreted as an integer in a little endian architecture the result is 0x00636261, which is decimal 6513249. On the other hand, placing each byte manually yields 0x61626300 -- decimal 1633837824.

Of course treating a char* as an int* is undefined behavior, so the difference is not important in practice because you are not really allowed to use the first conversion. There is however a way to achieve the same result, which is called type punning:

union {
    char str[4];
    unsigned int ui;
} u;

strcpy(u.str, "abc");
printf("%u\n", u.ui);
Community
  • 1
  • 1
Jon
  • 428,835
  • 81
  • 738
  • 806
  • Thanks. The picture makes it very clear. The answer I wanted was the one with bytes placed manually. BTW, You made a typo- 0x64 in array picture instead of 0x63. – avmohan Oct 11 '13 at 17:39
6

Neither of the first two is correct.

The first violates aliasing rules and may fail because the address of str is not properly aligned for an unsigned int. To reinterpret the bytes of a string as an unsigned int with the host system byte order, you may copy it with memcpy:

unsigned int a; memcpy(&a, &str, sizeof a);

(Presuming the size of an unsigned int and the size of str are the same.)

The second may fail with integer overflow because str[0] is promoted to an int, so str[0]<<24 has type int, but the value required by the shift may be larger than is representable in an int. To remedy this, use:

unsigned int b = (unsigned int) str[0] << 24 | …;

This second method interprets the bytes from str in big-endian order, regardless of the order of bytes in an unsigned int in the host system.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
1
unsigned int a = *(unsigned int*)str;

This initialization is not correct and invokes undefined behavior. It violates C aliasing rules an potentially violates processor alignment.

ouah
  • 142,963
  • 15
  • 272
  • 331
1

Both are correct in a way:

  • Your first solution copies in native byte order (i.e. the byte order the CPU uses) and thus may give different results depending on the type of CPU.

  • Your second solution copies in big endian byte order (i.e. most significant byte at lowest address) no matter what the CPU uses. It will yield the same value on all types of CPUs.

What is correct depends on how the original data (array of char) is meant to be interpreted.
E.g. Java code (class files) always use big endian byte order (no matter what the CPU is using). So if you want to read ints from a Java class file you have to use the second way. In other cases you might want to use the CPU dependent way (I think Matlab writes ints in native byte order into files, c.f. this question).

Community
  • 1
  • 1
Curd
  • 12,169
  • 3
  • 35
  • 49
  • Both of the first two can cause crashes. This should be mentioned in any answer. Neither is correct. – Eric Postpischil Oct 11 '13 at 17:43
  • @Eric Postpischil: *1st way*: alignment is a completely different issue that has nothing to do the OPs original question. In very many cases (i.e. on many hardware platforms) alignment doesn't matter at all and code like this is completely ok. *2nd way*: this will definitely not result in a crash on any circumstances (No matter if int is large enough for the value shifted by 24 bits) – Curd Feb 10 '14 at 14:24
  • Alignment does matter and does have to do with the OP’s original question: Aliasing a `char` array as an `int` is not guaranteed to conform to alignment requirements and may crash in some C implementations. The fact that is does not crash on many platforms does not make it okay because it does not erase the fact that it does crash on some. – Eric Postpischil Feb 10 '14 at 14:39
  • The second way may overflow in `str[0] << 24`. `str[0]` is a `char`, so it is promoted to `int` (except possibly in perverse C implementations where an `int` is not wider than a `char`). This is a signed integer. Then shifting it by 24 bits may overflow the range of an `int`. E.g., if `str[0]` is 128, then `str[0] << 24` would be 2147483648, but the largest value representable by a 32-bit signed `int` is 2147483647. The behavior of overflow with signed integers is not defined by the C standard. The program may crash or produce incorrect results. – Eric Postpischil Feb 10 '14 at 14:42
1

You said you want to copy byte-by-byte.

That means the the line unsigned int a = *(unsigned int*)str; is not allowed. However, what you're doing is a fairly common way of reading an array as a different type (such as when you're reading a stream from disk.

It just needs some tweaking:

 char * str ="abc";
int i;
unsigned a;
char * c = (char * )&a;
for(i = 0; i < sizeof(unsigned); i++){
   c[i] = str[i];
}
printf("%d\n", a);

Bear in mind, the data you're reading may not share the same endianness as the machine you're reading from. This might help:

void 
changeEndian32(void * data)
{
    uint8_t * cp = (uint8_t *) data;
    union 
    {
        uint32_t word;
        uint8_t bytes[4];
    }temp;

    temp.bytes[0] = cp[3];
    temp.bytes[1] = cp[2];
    temp.bytes[2] = cp[1];
    temp.bytes[3] = cp[0];
    *((uint32_t *)data) = temp.word;
}
  • For union members, results are implementation-dependent if something is stored as one type and extracted as another. – David Ranieri Oct 11 '13 at 17:54
  • @AlterMann - I didn't know that. I'm interested to learn more. Do you have a reference? My C is almost always 'implementation dependent' so I'm glad to have these things pointed out. –  Oct 11 '13 at 17:59
0

If your using CVI (National Instruments) compiler you can use the function Scan to do this:

unsigned int a;

For big endian: Scan(str,"%1i[b4uzi1o3210]>%i",&a);

For little endian: Scan(str,"%1i[b4uzi1o0123]>%i",&a);

The o modifier specifies the byte order. i inside the square brackets indicates where to start in the str array.

lupy87
  • 1