What is the relationship between memory representation and value of a variable in C?

Question

In C, it's true that:

[8-bit] signed char: -127 to 127
[8-bit] unsigned char: 0 to 255

But what does really happen in memory? Is a signed char represented in two's complement and a unsigned char represented without any specific representation (that is, a sequence of 11111111)?

How does the executable keep track of the variable type it's reading, to figure out whether the value in the CPU register is to be interpreted as two's complement or not? Is there some metadata that associates a variable name with its type?

Thanks!

Why not read the only [authoritative resource](http://port70.net/~nsz/c/c11/n1570.html#6.2.6)? It should answer all your questions. And you might want to reseach about _statically_ vs. _dynamically_ types languages. — too honest for this site, Aug 29 '15 at 14:19
This should be helpful: http://stackoverflow.com/questions/7681024/negative-numbers-are-stored-as-2s-complement-in-memory-how-does-the-cpu-know-i — aniliitb10, Aug 29 '15 at 14:23
Usually first bit is used for sign. `0b01111111` is 127, `0b10000000` is -128 — EvgeniyZh, Aug 29 '15 at 15:42

HelloWorld · Accepted Answer · 2015-08-29T14:32:13.573

5

There is no meta data. The final execution is done by the underlying hardware because the compiler uses different instructions when doing some operations on these types. It becomes more obvious when you compare the assembly.

void test1()
{
  char p = 0;
  p += 3;
}

void test2()
{
  unsigned char p = 0;
  p += 3;
}

What you see here are the instructions compiled by the compiler form the source posted above. Compiled with no optimization -O0 this is the created assembly of clang 3.7. You can ignore most of the instructions, if you are not familiar with them. Keep the focus on movsx and movzx. These two instructions make the difference how the memory location is treated.

test1():                              # Instructions for test1
    push    rbp
    mov rbp, rsp
    mov byte ptr [rbp - 1], 0
    movsx   eax, byte ptr [rbp - 1]   <-- Move byte to word with sign-extension
    add eax, 3
    mov cl, al
    mov byte ptr [rbp - 1], cl
    pop rbp
    ret

test2():                              # Instructions for test2
    push    rbp
    mov rbp, rsp
    mov byte ptr [rbp - 1], 0
    movzx   eax, byte ptr [rbp - 1]   <-- Move byte to word with zero-extension
    add eax, 3
    mov cl, al
    mov byte ptr [rbp - 1], cl
    pop rbp
    ret

edited Aug 29 '15 at 14:32

answered Aug 29 '15 at 14:23

HelloWorld

2,392
3
31
68

2

x86 assembler is not really a good one for demonstration to people who are likely not familar with assembler in general. You should explain the single lines. – too honest for this site Aug 29 '15 at 14:26
The actual operations movzx and movzx are used.. it's ""movsx"", i know it is typo! – aniliitb10 Aug 29 '15 at 14:29
1

I've compiled it with `signed char x = 255; unsigned char y = 255;`. I've noticed that the compiler is inserting -1 in the register: `movb $-1, -2(%rbp)`. If I compile it with `signed char x = 5` it enters $5. Why is that? Is it because 255 = -1 = 11111111 in two's complement? Thanks! – Laurent Aug 29 '15 at 14:30
@ldc: It depends on the compiler. `icc` prefers `$255` for `unsigned char x = 255`. Its up to the compiler. The internal representation is (and must) be still the same though. And yes, it's because of the two's complement. – HelloWorld Aug 29 '15 at 14:46

dtech · Answer 2 · 2015-08-29T15:34:28.933

C is a strongly typed language. The interpretation of memory is entirely defined by the context. That is, the type is (sufficiently well in the case of dynamic dispatch) known at compile time and the compiler makes all the decisions in advance. For the sake of performance, runtime checks are reduced to the bare minimum (in C to none unless you implement dynamic dispatch or RTTI manually).

In C (and C++) you can easily interpret the same memory location in different ways, all you have to do is acquire a pointer to it and cast it to a different type. Very unsafe if you don't know what you are doing.

Fawzan · Answer 3 · 2015-08-29T14:37:51.457

0

The internal representation of numbers is not part of C language, it's a feature of the architecture of the machine itself. Most implementations use 2's complement because it makes addition and subtraction the same binary operation (signed and unsigned operations are identical).

FYI Almost all existing CPU hardware uses two's complement, so it makes sense that most programming languages do, too.

edited Aug 29 '15 at 14:37

answered Aug 29 '15 at 14:17

Fawzan

4,738
8
41
85

1

And no, the language does not support. It, but leaves it to the implementation. Note that the conversion rules for singed to unsigned are actually optimized for 2s complement. – too honest for this site Aug 29 '15 at 14:23
In C as originally conceived, the value of a variable was effectively defined by the underlying storage. Changing the value would modify the storage, and changing the storage would modify the value. The C89 Standard weakened that connection, and C99 weakened it further, to the point that C99 is semantically much weaker than the language which was popular in the 1990s. – supercat Feb 14 '16 at 17:43

What is the relationship between memory representation and value of a variable in C?

3 Answers3