Memory interpretation while casting primitives

Question

In languages like C/C++, when we do:

char c = 'A';

We allocate memory to store number 65 in binary:

stuff_to_the_left_01000001_stuff_to_the_right

Then if we do:

int i = (int) c;

As I understand it, we're saying to the compiler that it should interpret bit pattern layed out as stuff_to_the_left_01000001__00000000_00000000_00000000_stuff_to_the_right, which may or may not turn out to be 65.

The same happens when we perform a cast during an operation

cout << (int) c << endl;

In all of the above, I got 'A' for character and 65 in decimal. Am I being lucky or am I missing something fundamental?

R.. GitHub STOP HELPING ICE · Accepted Answer · 2012-08-11T02:43:14.753

2

Casts in C do not reinterpret anything. They are value conversions. (int)c means take the value of c and convert it to int, which is a no-op on essentially all systems. (The only way it could fail to be a no-op is if the range of char is larger than the range of int, for example if char and int are both 32-bit but char is unsigned.)

If you want to reinterpret the representation (bit pattern) underlying a value, that value must first exist as an object (lvalue), not just the value of an expression (typically called "rvalue" though this language is not used in the C standard). Then you can do something like:

*(new_type *)&object;

However, except in the case where new_type is a character type, this invokes undefined behavior by violating the aliasing rules. C++ has a sort of "reinterpret cast" to do this which can presumably avoid breaking aliasing rules, but as I'm not familiar with C++, I can't provide you with good details on it.

In your C++ example, the reason you get different results is operator overloading. (int)'A' does not change the value or how it's interpreted; rather, the expression having a different type causes a different overload of the operator<< function to be called. In C, on the other hand, (int)'A' is always a no-op, because 'A' has type int to begin with in C.

edited Aug 11 '12 at 02:43

answered Aug 11 '12 at 02:37

R.. GitHub STOP HELPING ICE

208,859
35
376
711

1

Converting char to int is not a no-op. it requires zero-extending or sign-extending the value, depending on whether char is signed or not (and neglecting esoteric implementations permitted by the C standard). – Eric Postpischil Aug 11 '12 at 02:50
@Eric: In the abstract machine of the C language, that is a no-op. 42 is 42. It's a value. The representation only matters when it's stored in an object. Even in the physical machine with registers and zero/sign extension, the extension operation, if needed, is likely to have occurred at a completely different point from the cast. – R.. GitHub STOP HELPING ICE Aug 11 '12 at 03:20
The clause “which is a no-op on essentially all systems” speaks of operations on systems, not of operations in the C abstract machine. The statement that the extension operation is likely to have occurred at a completely different point from the cast is nonsensical because the extension is in the physical machine, and the cast is in the C source. The point is that conversion from char to int does not necessarily have zero cost; sometimes it does require processor time and energy to compute. – Eric Postpischil Aug 11 '12 at 11:15
Said differently, what I mean by claiming it is a no-op is that there is no C code whose behavior (or even whose machine code output from the compiler) will be changed by changing `c` to `(int)c` except for `sizeof c`. – R.. GitHub STOP HELPING ICE Aug 11 '12 at 15:57

score 2 · Answer 2 · edited May 23 '17 at 11:56

2

Am I being lucky or am I missing something fundamental?

Yes, you are missing something fundamental: the compiler does not read the char from the memory as if the memory represented an int. Instead, it reads a char as a char, and then sign-extends the value to fit in an int, so char -1 becomes int -1 as well. Sign-extending means adding 1s or 0s to the left of the most significant byte being extended, depending on the sign bit of that number. Unsigned types are always padded by zeros^*.

Sign extension is usually done in a register by executing a dedicated hardware instruction, so it runs very fast.

^* As Eric Postpischil noted in a comment, char type may be signed or unsigned, depending on the C implementation.

edited May 23 '17 at 11:56

Community

1
1

answered Aug 11 '12 at 02:40

Sergey Kalinichenko

714,442
84
1,110
1,523

The char type may be signed or unsigned, depending on the C implementation. – Eric Postpischil Aug 11 '12 at 02:53
@EricPostpischil Thank you very much for the comment. I added it to the answer, thank you very much! – Sergey Kalinichenko Aug 11 '12 at 02:58

score 0 · Answer 3 · answered Aug 11 '12 at 02:40

0

When you allocate a char, there's no such thing as stuff to the left or right. It's eight bit, nothing more. So then when you cast an eight-bit value to 32 bits, you still get 65:

0100.0001 to 0000.0000 0000.0000 0000.0000 0100.0001

No magic, no luck.

answered Aug 11 '12 at 02:40

dda

6,030
2
25
34

A char is not necessarily eight bits, nor is ‘A’ necessarily 65. – Eric Postpischil Aug 11 '12 at 02:51
In most cases both facts are true. ASCII and sizeof(char)=1 is quite the default behaviour. Plus of course the OP himself assumed 'A' = 65... – dda Aug 11 '12 at 02:59
@R..: Yes, the C standard specifies that a char is one byte. It does not specify that one byte is eight bits. – Eric Postpischil Aug 11 '12 at 09:56
@dda: ASCII is not the “default.” The character set depends on the C implementation. Some C implementations use EBCDIC, and some may use specialized character sets. – Eric Postpischil Aug 11 '12 at 09:59

score 0 · Answer 4 · answered Aug 11 '12 at 02:48

In your code "i" has its own address and "c" has its own. Value is being 'copied' from c to i. As for "(int) c", same is done again. Though compiler does that for us, as follows.

     |--- i ---|- c-|  
 0x01 0x02 0x03 0x04
+--------------------......
| 00 | 00 | 08 | 08 |......  
+--------------------......

You would have been correct, if this was pointer based allocation.

e.g.

 0x01 0x02 0x03
+---------------......
| 07 | 10 | 08 |......
+---------------......
int *p;
char c = 10;
p = &c;
print(*p); //not a real method just something that can print.

Here *p would have combined values from mem address 0x02 and 0x03.

score -1 · Answer 5 · answered Aug 11 '12 at 02:46

-1

Well, the thing is, that this behavior can change depending on the platform you're compiling to and the compiler your're using.

The ISO standard defines (int) to be a cast. In this case, your compiler will interpret (int)c like static_cast(c) //in c++

Now, you're lucky, your compiler interprets (int) as a simple cast. It's common behavior for any c/c++ compiler but there might be some evil, no-name c++ compilers, which will do a reinterpret cast on that one, ending up in an unpredictable result (depending on the platform).

That is why you should use the static_cast(c) to be 100% shure and if you want to reinterpret it, of course reinterpret_cast(c)

but, again, it's usually a cast in c style and therefor the c will be casted into an integer.

answered Aug 11 '12 at 02:46

Alzurana

23
1
6

The C standard requires converting a char to int to produce the same value. It is not undefined behavior. – Eric Postpischil Aug 11 '12 at 02:56
@Eric: If `sizeof(int)==1` and `char` is unsigned or `int` has padding bits, applying the `(int)` cast to a `char` does not produce the same value in all cases; the result is implementation-defined (or an implementation-defined signal) if the value does not fit in `int`. – R.. GitHub STOP HELPING ICE Aug 11 '12 at 03:22
@R..: Clauses 5 and 6 of the C standard permit an int to be effectively a signed char. However, this would break the behavior specified in 7.19. We can pass any int except EOF to ungetc, and it is converted to an unsigned char. This allows us to put any unsigned char value into the stream, possibly excepting the one corresponding to EOF. Then we can call fgetc, and it must convert that unsigned char to int, and the result is required to specify the same character as the int originally passed to ungetc... – Eric Postpischil Aug 11 '12 at 11:06
This compels the conversion from unsigned char to int to be the inverse of the conversion from int to unsigned char, modulo UCHAR_MAX+1. That does leave the possibility that `(int) (unsigned char) EOF` produces a signal or implementation-defined value. So you are correct, unless we can complete this single-character gap. – Eric Postpischil Aug 11 '12 at 11:06
I think this completes it: 7.19.2 requires that data read in from a binary stream compare equal to data earlier written to a stream, and we can write any character (including the one corresponding to EOF) with fputc. – Eric Postpischil Aug 11 '12 at 11:10
By the way, this has already been discussed in detail: http://stackoverflow.com/questions/3860943/can-sizeofint-ever-be-1-on-a-hosted-implementation – R.. GitHub STOP HELPING ICE Aug 11 '12 at 12:14

Memory interpretation while casting primitives

5 Answers5