How does C know what type to expect?

Question

If all values are nothing more than one or more bytes, and no byte can contain metadata, how does the system keep track of what sort of number a byte represents? Looking into Two's Complement and Single Point on Wikipedia reveals how these numbers can be represented in base-two, but I'm still left wondering how the compiler or processor (not sure which I'm really dealing with here) determines that this byte must be a signed integer.

It is analogous to receiving an encrypted letter and, looking at my shelf of cyphers, wondering which one to grab. Some indicator is necessary.

If I think about what I might do to solve this problem, two solutions come to mind. Either I would claim an additional byte and use it to store a description, or I would allocate sections of memory specifically for numerical representations; a section for signed numbers, a section for floats, etc.

I'm dealing primarily with C on a Unix system but this may be a more general question.

Every variable in C must have a type as part of the variable declaration (possibly definition, never sure which). The compiler reads the type and remembers it. There is no mystery about the type of a variable. — DwB, Mar 01 '13 at 17:25
Compiler will use the correct instruction (since unsigned and signed can be derived with rules stated in the standard). There are different instructions for sign and unsigned computation, at least for the architectures that I have seen. For C, no additional information about type is stored at run time. — nhahtdh, Mar 01 '13 at 17:26
Actually only the compiler knows the type. It compiles your code into a program which contains the right instructions to manipulate that type. The type is not reflected at all in the resulting machine code. — Hannesh, Mar 01 '13 at 17:26
Yeah, it's more on the point and accurate than anything else posted. — Sam Dufel, Mar 01 '13 at 17:34
I would add in your link to symbol tables to his answer to explain how the compiler outputs the correct instructions, but otherwise agreed. — ajp15243, Mar 01 '13 at 17:38
@Hannesh That makes sense. I had suspected that this problem was specifically handled within the compiler. — Jack Stout, Mar 01 '13 at 17:49

score 9 · Accepted Answer · answered Mar 01 '13 at 17:42

9

how does the system keep track of what sort of number a byte represents?

"The system" doesn't. During translation, the compiler knows the types of the objects it's dealing with, and generates the appropriate machine instructions for dealing with those values.

answered Mar 01 '13 at 17:42

John Bode

119,563
19
122
198

Then the compiler must be maintaining some metadata which is used while writing the assembly code, then thrown away. That answers my question. Thank you. – Jack Stout Mar 01 '13 at 18:00
3

@JackStout: Pretty much, yes. Most compilers maintain what's called a *symbol table*, which contains information about an object's type, visibility, lifetime, linkage, etc. During the translation phase it's used to enforce semantic rules (such as matching types in assignments, or preventing you from modifying a `const`-qualified object). During the code generation phase it's used to pick the right machine instructions for the operation (e.g., this longword is being used for a floating-point calculation). – John Bode Mar 01 '13 at 18:09

score 1 · Answer 2 · answered Mar 01 '13 at 17:28

Ooh, good question. Let's start with the CPU - assuming an Intel x86 chip.

It turns out the CPU does not know whether a byte is "signed" or "unsigned." So when you add two numbers - or do any operation - a "status register" flag is set.

Take a look at the "sign flag." When you add two numbers, the CPU does just that - adds the numbers and stores the result in a register. But the CPU says "if instead we interpreted these numbers as twos complement signed integers, is the result negative?" If so, then that "sign flag" is set to 1.

So if your program cares about signed vs unsigned, writing in assembly, you would check the status of that flag and the rest of your program would perform a different task based on that flag.

So when you use signed int versus unsigned int in C, you are basically telling the compiler how (or whether) to use that sign flag.

If I remember correctly, the advantage of using two's complement is that you just perform normal addition on the numbers. The CPU doesn't need to care about whether the number is signed / unsigned / negative, it just adds the bits. It's the higher level code that interprets a value as positive or negative. — Sam Dufel, Mar 01 '13 at 17:34
@SamDufel this is exactly right. the flags provide a convenience in this case - rather than writing a routine to check the high-order bit on a number, you can `jump` depending on that register. (Well, there are other tricks that one uses that bit for, but you are indeed correct that, for this purpose, the flag is not strictly necessary.) — poundifdef, Mar 01 '13 at 20:37

Matthew Sanders · Answer 3 · 2013-03-01T17:50:40.460

1

It is important to remember that C and C++ are high level languages. The compiler's job is to take the plain text representation of the code and build it into the platform specific instructions the target platform is expecting to execute. For most people using PCs this tends to be x86 assembly.

This is why C and C++ are so loose with how they define the basic data types. For example most people say there are 8 bits in a byte. This is not defined by the standard and there is nothing against some machine out there having 7 bits per byte as its native interpretation of data. The standard only recognizes that a byte is the smallest addressable unit of data.

So the interpretation of data is up to the instruction set of the processor. In many modern languages there is another abstraction on top of this, the Virtual Machine.

If you write your own scripting language it is up to you to define how you interpret your data in software.

edited Mar 01 '13 at 17:50

answered Mar 01 '13 at 17:37

Matthew Sanders

4,875
26
45

C is a high level language? really? It's a joke, C is just 3mm higher than a macro assembler! ADA is a high level language. – Aubin Mar 01 '13 at 17:38
4

Yes technically C is a high level language. Much like GLSL is a high level shading language that builds into assembly level for the GPU. Assembly is the target language that C is built into, and technically machine language or just a simple binary set of data is the lowest. Man people don't think of C or C++ as high level any longer as we always think of scripting languages as such. There was a time people would code in binary. – Matthew Sanders Mar 01 '13 at 17:40
3

@Aubin: C is every bit as "high-level" as Ada; it just doesn't provide as many *abstractions* as Ada. – John Bode Mar 01 '13 at 17:44
1

See http://en.wikipedia.org/wiki/High-level_programming_language, "Relative Meaning" chapter. – Aubin Mar 01 '13 at 17:48

nemo · Answer 4 · 2013-03-01T17:48:13.633

The code that is executed has no information about the types. The only tool that knows the types is the compiler at the time it compiles the code. Types in C are solely a restriction at compile time to prevent you from using the wrong type somewhere. While compiling, the C compiler keeps track of the type of each variable and therefore knows which type belongs to which variable.

This is the reason why you need to use format strings in printf, for example. printf has no chance of knowing what type it will get in the parameter list as this information is lost. In languages like go or java you have a runtime with reflection capabilities which makes it possible to get the type.

Suppose your compiled C code would still have type information in it, there would be the need for the resulting assembler language to check for types. It turns out that the only thing close to types in assembly is size of the operands for an instruction determined by suffixes (in GAS). So what is left from your type information is the size and nothing more.

One example for assembly which supports type is the java VM bytecode, which has type suffixes for operands for primitives.

score 0 · Answer 5 · answered Mar 01 '13 at 18:21

Using C besides the compiler, that perfectly well knows about the type of the given values there is no system that knows about the type of a given value.

Note that C by itself doesn't bring any runtime type information system with it.

Take a look at the following example:

int i_var;
double d_var;

int main () {

  i_var = -23;
  d_var = 0.1;

  return 0;
}

In the code there are two different types of values involved one to be stored as an integer and one to be stored as a double value.

The compiler that analyzes the code pretty well knows about the exact types of both of them. Here the dump of a short fragment of the type information gcc held while generation code generated by passing the -fdump-tree-all to gcc:

@1      type_decl        name: @2       type: @3       srcp: <built-in>:0      
                         chan: @4      
@2      identifier_node  strg: int      lngt: 3       
@3      integer_type     name: @1       size: @5       algn: 32      
                         prec: 32       sign: signed   min : @6      
                         max : @7      
...
@5      integer_cst      type: @11      low : 32      
@6      integer_cst      type: @3       high: -1       low : -2147483648 
@7      integer_cst      type: @3       low : 2147483647 
...

@3805   var_decl         name: @3810    type: @3       srcp: main.c:3      
                         chan: @3811    size: @5       algn: 32      
                         used: 1       
...
@3810   identifier_node  strg: i_var    lngt: 5

Hunting down the @links you should clearly see that there really is a lot of information stored about memory-size, alignment-constraints and allowed min- and max-values for the type "int" stored in the nodes @1-3 and @5-7. (I left out the @4 node as the mentioned "chan" entry is just used to cha i n up any type definitions in the generated tree)

Reagarding the variable declared at main.c line 3 it is known, that it is holding a value of type int as seen by the type reference to node @3.

You'll sure be able to hunt down the double entries and the ones for d_var in an own experiment yourself if you don't trust me they will also there.

Taking a look at the generated assembler code (using gcc pass the -S switch) listed we can take a look at the way the compiler used this information in code generation:

    .file   "main.c"
    .comm   i_var,4,4
    .comm   d_var,8,8
    .text
.globl main
    .type   main, @function
main:
    pushl   %ebp
    movl    %esp, %ebp
    movl    $-23, i_var
    fldl    .LC0
    fstpl   d_var
    movl    $0, %eax
    popl    %ebp
    ret
    .size   main, .-main
    .section    .rodata
    .align 8
.LC0:
    .long   -1717986918
    .long   1069128089
    .ident  "GCC: (Debian 4.4.5-8) 4.4.5"
    .section    .note.GNU-stack,"",@progbits

Taking a look at the assignment instructions you will see that the compiler figured out the right instructions "mov" to assign our int value and "fstp" to assign our "double" value.

Nevertheless besides the instructions chosen at the machine level there is no indication of the type of those values. Taking a look at the value stored at .LC0 the type "double" of the value 0.1 was even broken down in two consecutive storage locations each for a long to meet the known "types" of the assembler.

As a matter of fact breaking the value up this way was just one choice of other possiblities, using 8 consecutive values of "type" .byte would have done equally well.

How does C know what type to expect?

5 Answers5