3

What are the main steps behind compiling a C program? By compiling, I mean (maybe wrongly) getting a binary from a plain text containing C code, using gcc.

I would love to understand some key points of the process:

  1. By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?

  2. Is gcc converting any C to assembly language?

  3. I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?

  4. Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?

Pabluez
  • 2,653
  • 3
  • 19
  • 29
  • **as** - assembler, **ld** - linker, GCC comes with those – bhathiya-perera Nov 20 '14 at 03:29
  • Please see this: http://stackoverflow.com/questions/6264249/how-does-the-compilation-linking-process-work – happyvirus Nov 20 '14 at 03:29
  • gcc dont convert C directly into assembly..This will give you a better idea: http://en.wikipedia.org/wiki/GNU_Compiler_Collection#GENERIC_and_GIMPLE – sunny1304 Nov 20 '14 at 03:40
  • @Sajidkhan , I saw this question before asking. If you know the 4 points that includes my question I appreciate your answer. They are not clearly covered in this other post. – Pabluez Nov 20 '14 at 03:45
  • Compilers in general are not required waste time by emitting textual assembly code and then parse it again. They already had the internal representation, they can just use that to create object files. – harold Nov 20 '14 at 07:56
  • 1
    What text editor are you using? If you're using VIM then you can do `vim -b a.out` and then `:%!xxd`. This will show the hexadecimal values of your binary file. You can also see the hexadecimal values of your binary file with `objdump -s a.out`. – Z boson Nov 24 '14 at 10:23
  • 1
    I'm voting to close this question as off-topic because conceptual questions about compilation belong on Computer Science SE – ali_m Mar 19 '15 at 10:53

4 Answers4

11

Lots happens :)

Here are some of the key steps (BTW, these are how I think of compilation, the following steps only have a passing resemblance to the steps defined in the standard).

  1. The preprocessor runs on the source file.

    The pre-processor does all sort of things for us, including:

    • It performs tri-glyph (special three character sequences that represented some of the special symbols that early keyboards didn't have) replacement.
    • It performs macro replacement (i.e. #define) by simple textual replacement
    • It grabs any header files and copies their entire contents to where the #include line was.

    Under Linux, the program that does this is m4, and using gcc you can stop after this step by using the -E flag.

  2. After the pre-processor runs, we have a file that contains all the information that is necessary for the parser to run and check our syntax, and emit assembly. Under Linux, the program that most likely does this is cc1, and using gcc you can stop after this step by using the -s flag.

  3. The assembly is converted into object code by, most likely, the program gas (GNU Assembler), and using gcc you can stop at this step by using the -c flag.

  4. Finally one or more object files, along with libraries, are converted into an executable by the linker. The linker under Linux is normally ld, and using gcc without any special flags run all the way through this.

nrz
  • 10,435
  • 4
  • 39
  • 71
thurizas
  • 2,473
  • 1
  • 14
  • 15
  • Thanks for your answer. I will try to use the -C flag and see the assembly code that gcc is generating. Would you mind updating your answer to cover the 4 questions I've listed in my original question? If you know the answers, of course. thanks in advace – Pabluez Nov 20 '14 at 03:51
  • 1
    You will not see any assembly is you pass the `-c` option, that will compile to object. You will want to pass the `-S` option which will compile to assembly (default AT&T format). To output `intel` format assembly pass the `-masm=intel` option. So if you want assembly in intel format: `gcc -S -masm=intel -o outfile.asm infile.c` – David C. Rankin Nov 20 '14 at 04:29
  • 1
    Nice explanation of the traditional compiler. There is also often an intermediate code between the -E and -S (gimple?, llvm ir, etc...) that may only be useful if running with a JIT compiler. – technosaurus Nov 20 '14 at 05:52
  • Nice answer. I added some code formatting and bolded the names of the phases to make it easier to follow. – nrz Nov 20 '14 at 06:43
  • What is exactly an object code? Is it the code converted from Assembly to 0s and 1s in a way the processor will understand? What is the magic the linker does to make something executable? – Pabluez Nov 21 '14 at 15:12
  • @Pabluez Object code is machine language (i.e. 0s and 1s). I use it to represent the result of converting one file all the way to machine. In large programs there may be a large number of source files, and each of these go through the chain described above to produce object files. It is not until the linker that all the object files get combined into an executable or a library. – thurizas Nov 21 '14 at 15:23
  • 1
    @Pabluez You might want to ask a question about what a linker does, I doubt I can squeeze a good answer into a comment. However, you can check out [Linkers and Loaders](http://linker.iecc.com/) by John Levine, it is a good place to start learning about them. – thurizas Nov 21 '14 at 15:25
6

Since you specifically mentioned 'By the end of the day I need to transform my C code to a language that specifically my CPU should understand,' I'll explain a little about how compilers work.

Typical compilers do a few things.

First, they do something called lexing. This step takes individual characters and combines them into 'tokens' which are things the next step understands. This step differentiates between language keywords (like 'for' and 'if' in C), operators (like '+'), constants (like integers and string literals), and other stuff. What exactly it differentiates depends on the language itself.

The next step is the parser, which takes the stream of tokens produced by the lexer and (commonly) converts it into something called an "Abstract Syntax Tree," or AST. The AST represents the computations done by the program with data structures that the compiler can navigate. Commonly the AST is language-independent, and compilers like GCC can parse different languages into a common AST format that the next step (the code generator) can understand.

Finally, the code-generator goes through the AST and outputs code that represents the semantics of the AST, that is, code that actually performs the computations that the AST represents.

In the case of GCC, and probably other compilers, the compiler does not actually produce machine code. Instead, it outputs assembly code that it passes to an assembler. The assembler goes through a similar process of lexing, parsing, and code-generating to actually produce machine-code. After all, an assembler is just a compiler that compiles assembly code.

In the case of C (and many others) The assembler is commonly not the final step. The assembler produces things called object files, which contain unresolved references to functions in other object files or libraries (like printf in the C standard library or functions from other C files in your project). These object files are passed to something called a 'linker' whose job it is to combine all of the object files into a single binary, and resolve all of the unresolved references in the object files.

Finally, after all of these steps, you have a complete executable binary.

Note that this is the way that GCC and many, many other compilers work, but it's not necessarily the case. Any program that you could write that accurately accepts a stream of C code and outputs a stream of some other code (assembly, machine code, even javascript) that is equivalent, is a compiler.

Also, the steps are not always completely separate. Rather than lexing and entire file, then parsing the entire result, then generating code for the entire AST, a compiler may do a bit of lexing, then start parsing when it has some tokens, then go back to lexing when the parser needs more tokens. When the parser feels it knows enough, it might do some code generation before having the lexer produce some more tokens for it.

  • 1
    Good discussion. The only thing missing is **What the linker is creating.** A short discussion of what the `ELF` format (and competing formats) are would be beneficial [**Executable and Linkable Format (ELF)**](http://www.skyfree.org/linux/references/ELF_Format.pdf). That would make it fairly complete. – David C. Rankin Nov 20 '14 at 04:23
  • Thanks very much. I considered that, but it seemed sort of beyond the scope. There are many executable binary formats, and how exactly they're formatted didn't really seem relevant. – jack_rabbit Nov 20 '14 at 04:25
  • great explanation. I still have the 4 points within the question to be answered. Another question comes from your answer too: Is it possible so to a experienced assembly programer to write a "bash shell script" that converts Bash itself to assembly, I could make it executable linkable with ld? – Pabluez Nov 20 '14 at 04:25
  • 1
    @Pabluez absolutely. It would be ugly and probably horrible, but you could potentially write a bash to assembly compiler in bash. – jack_rabbit Nov 20 '14 at 04:26
  • 1
    @Pabluez As for why you can't see the 0's and 1's, your editor interprets bytes as characters. You probably see some crazy glyphs if you try to open a binary file. If you want to get close (hex is as good as binary) try `$ hexedit my_executable` – jack_rabbit Nov 20 '14 at 04:28
  • @Pabluez And for #1, only the CPU cares about a binary having the proper instructions. – jack_rabbit Nov 20 '14 at 04:29
  • @jack_rabbit And why isn't actually 0s and 1s literals within the file? If all the process is a standard and anyone smart enough could develop an assembler and a linker, why isn't it possible to see the 0s and 1s generated? Why an hex editor if it's a binary format? This is a part that I can't understand quite well. – Pabluez Nov 20 '14 at 04:32
  • 1
    @Pabluez If you open a file, and write a bunch of 1's and 0's, you aren't literally writing that sequence of 1 bits and 0 bits to a file, you are writing bytes (probably 8 bits) for each '1' and '0' encoded in a character encoding (probably ASCII) – jack_rabbit Nov 20 '14 at 04:35
  • 1
    You could feasibly write a text editor that would read and edit binary, in fact, I'm sure that there are many out there, but they'd be relatively useless, since long strings of binary are insanely hard for humans to read. – jack_rabbit Nov 20 '14 at 04:36
  • @jack_rabbit - most people use hex editors (because as you said binary isn't exactly human readable) and people use them all of the time. I have used them to make older versions of skype and flash player continue to work as if they had been upgraded. Minimum Profit text editor has one built-in and has gtk, qt and ncurses backends. – technosaurus Nov 20 '14 at 05:47
2

By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?

You are not very clear here. If you are asking, which tool has knowledge of your CPU specific instructions, it's the assembler, disassembler, debugger, and maybe some others. They can generate machine code or convert it back to disassembly.

If you are asking who cares about which instructions are used, it's the processor that needs to execute them, as each instruction set represents even such common instruction as "add two integers" in completely different manner.

Is gcc converting any C to assembly language?

Yes, C (or program in any other supported language) is converted to assembly by GCC. There are many steps involved, and at least two additional internal representations used in process. Details are explained in GCC internals document. Finally compiler "backend" generates assembly representation of simple "patterns", generated by previous compiler passes. You can ask GCC to output this assembly by using -S flag. If you don't specifically ask for it, next step (assembling) is automatically executed and you only see your final executable file.

I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?

First take note that assembly languages for each CPU differ, as they are supposed to represent CPU's machine language 1:1. Assembler then translated assembly code into machine code. Who ships it? Anyone who builds it. With GNU toolchain it's part of binutils package and it's usually installed by default on most Linux distributions. This is not only assembler available. Also note, that although GNU "suite" (GCC/binutils/gdb) support many architectures, you need to use appropriate port for your architecture. Your desktop PC's default assembler for example can not compile/assemble into ARM machine code.

Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?

Because text editor is supposed to show text representation of that 0s and 1s. Assuming each character in file takes 8 bits they interpret each subseqent 8-bits as single character, instead of showing separate bits. If you know that in standard 8 bit ASCII letter 'A' is represented by value 65, you can also convert this back to binary: 01000001. It's a bit easier to convert hexadecimal representation back to binary. For this you can use hexdump (or similar) tool.

dbrank0
  • 9,026
  • 2
  • 37
  • 55
  • great answer. With there an assembly language for each CPU, you mean architecture? Because I can download the same binary of a program and it will work in any processor of the architecture the the code was compiled for, right? – Pabluez Nov 21 '14 at 15:01
  • Another thing: Other mate has said that the assembly conversion was an option, but the GCC has tools to convert the C source straightway to the object file to be used by the linker (ld). What does it mean? is that true? – Pabluez Nov 21 '14 at 15:04
  • 1
    More or less... There are many CPUs in x86 arhicitecture, but new instructions are added in each CPU generation. So not all CPUs within architecture are compatible. As far as I know, GCC backends always create assembly code internally and "compiler driver" then calls assembler to assemble it and create object file. As you will always have binutils installed if you want to use GCC, this is not issue. Other compilers may generate machine code directly. – dbrank0 Nov 21 '14 at 17:09
1

By the end of the day I need to transform my C code to a language that specifically my CPU should understand. So, who cares about knowing my CPU-specific instructions? The operating system?

The CPU.

But note that on a modern computer the apparently single CPU is just an illusion.

It's a good enough conceptual model for simple C programming, though.


Is gcc converting any C to assembly language?

If you ask it to. Option -S will generate an assembly listing. For the PC you can choose between AT&T syntax, which is ugly as sin, peppered with percent signs, and the ordinary Intel syntax. Unfortunately AT&T (selectable via -masm=att IIRC) is the default, but you can use -masm=intel to get ordinary assembly.

If you don't ask it to produce assembly, then gcc presumably generates object code directly from its internal abstract syntax tree (AST).

Producing assembly language as an intermediate form would just be adding complexity and inefficiency, so I highly doubt that it does that.


I know (actually guess) that for each processor type I will need an assembler that will interpret (?) the assembly code and translate to my CPU specific instructions. Where is this assembler (who ships it)? Does it comes with the OS?

You don't need such assembler. But gcc ships with an assembler, as. Unix-like OS-es typically have gcc and as bundled, while Windows does not have developer tools bundled. Microsoft's dev tools are however free for downloading, now (in the last week or so) including the full Visual Studio IDE. Microsoft's assembler is ml.exe, and is known as MASM, the Macro Assembler (as if there were no other macro assemblers).


Why exactly I can't see the 0s and 1s if I open the binary file with a text editor?

That depends on the text editor, although I don't know of any that can present 0s and 1s; text editors are designed to interpret bytes as text.

You can just write such a text editor if you want it.

Fair warning though: it has no practical use that I can think of.


Finally regarding the question in the title,

What are the main steps behind compiling?

In practice there are two main steps: compilation and linking. The compilation step is further subdivided inte preprocessing and core language compilation, i.e.,

    compilation → linking

… is really

    (preprocessing → core language compilation) → linking

During the preprocessing source code files are combined via #include directives. This produces a full translation unit of source code. The core language compilation translates that to an object code file, which contains machine code with some unresolved references.

Then finally the linking step combines object code files (including object code file contents in libraries) to create a single complete executable.

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331