201

This has been bugging me for a while. How do GCC and g++ compile themselves?

I'm guessing that every revision gets compiled with a previously built revision. Is this true? And if it is, does it mean that the oldest g++ and GCC versions were written in assembly?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user1010005
  • 2,617
  • 5
  • 22
  • 20

2 Answers2

188

The oldest version of GCC was compiled using another C compiler, since there were others when it was written. The very first C compiler ever (ca. 1973, IIRC) was implemented either in PDP-11 assembly, or in the B programming language which preceded it, but in any case the B compiler was written in assembly. Similarly, the first ever C++ compiler (CPre/Cfront, 1979-1983) were probably first implemented in C, then rewritten in C++.

When you compile GCC or any other self-hosting compiler, the full order of building is:

  1. Build new version of GCC with existing C compiler
  2. re-build new version of GCC with the one you just built
  3. (optional) repeat step 2 for verification purposes.

This process is called bootstrapping. It tests the compiler's capability of compiling itself and makes sure that the resulting compiler is built with all the optimizations that it itself implements.

EDIT: Drew Dormann, in the comments, points to Bjarne Stroustrup's account of the earliest implementation of C++. It was implemented in C++ but translated by what Stroustrup calls a "preprocessor" from C++ to C; not a full compiler by his definition, but still C++ was bootstrapped in C.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 2
    Isn't step 2 typically repeated a couple of times? I believe `clang` does it. – bitmask Feb 24 '12 at 11:04
  • I don't see why, except for testing. Assuming the initial compiler implements the language correctly, it builds a correct though perhaps suboptimal compiler, while step 2 would implement all the optimizations that the new compiler has. The final binary should not change after step 2. – Fred Foo Feb 24 '12 at 11:08
  • 21
    The 3-step version of the bootstrap build process is indeed for verification: the compiler itself is used as its own test case. GCC compiled with [other] should produce the same results (identical binaries, discounting macros like `__DATE__` and `__TIME__` which vary even between invocations of the *same* compiler) as GCC compiled with [GCC compiled with [other]] - if not, that's a bug, and the 3-stage bootstrap build is designed to catch that. – pmdj Feb 24 '12 at 11:12
  • 21
    @pmjordan: "if not, that's a bug" or, less likely, a devious backdoor in the process of being introduced ("Reflections on Trusting Trust"). – Steve Jessop Feb 24 '12 at 11:20
  • @pmjordan: Actually, the binaries could also differ because code generation was improved (e.g. better optimization). So in general the binaries need not be identical. – sleske Feb 24 '12 at 11:35
  • @sleske: They should converge, however. Unless you reach *The Singularity* :) – bitmask Feb 24 '12 at 11:37
  • 13
    @sleske: that's not true. The binary output of step 2 must be identical to the binary output of step 3, otherwise there's a bug somewhere. The reason is as pmjordan says: NewCompiler1 and NewCompiler2 are programs with identical source (that of NewCompiler). They are given identical input (the source for NewCompiler). Therefore they will produce identical output no matter what compiler they themselves were compiled with (in this case, NewCompiler1 was compiled with OldCompiler, and NewCompiler2 was compiled with NewCompiler1). That is, NewCompiler2 and NewCompiler3 are binary identical. – Steve Jessop Feb 24 '12 at 11:52
  • 1
    @SteveJessop: Sorry, my mistake. Of course the output of step 1 and step 2 above can be different, but 2 and 3 must generate the same GCC binary. – sleske Feb 24 '12 at 11:58
  • 14
    I you ever wondered: What if we lost all C compiler binaries? And had to bootstrap from scratch? This is how I'd go about it: There's the Tiny C Compiler (which actually can compile the Linux kernel, so it's quite feature complete). All it's C source files make a mere 30k lines of code, including comments. Though even it was a quite some effort, somebody who understands C could learn from the sources, how to generate binary output and "compile" the TCC sources from hand (I actually thinking of punch cards here). Then recompile TCC with that and use it to bootstrap GCC or similar. – datenwolf Feb 24 '12 at 12:03
  • 12
    @datenwolf: something like that, yes. If we can assume that we've lost all C compiler binaries, but we still have an assembler, then we might write an assembler program TinyTinyC. It would be a less feature-complete C compiler than TinyC: we don't need it to be able to compile GCC or the linux kernel, we only need it to be able to compile TinyC. Then run it on the source of TinyC, that gives us a C compiler capable of compiling Linux (and hopefully glibc and GCC) and we're in business. If we don't even have an assembler, then we'd first bootstrap one of those, it's easier than a C compiler. – Steve Jessop Feb 24 '12 at 12:07
  • @SteveJessop: If TCC can't compile GCC I'd try PCC as a intermediary. – datenwolf Feb 24 '12 at 12:48
  • 3
    @datenwolf: I'm probably not the only geek who wrote his own VM and a compiler for it. It's not _that_ hard. – MSalters Feb 24 '12 at 13:19
  • @user1010005: It's very unlikely we lost all C compiler binaries. I've still got several Linux Distrubtions on DVDs at home, the Visual Studio CDs, etc. All it takes is only one working copy of a C compiler to bootstrap the rest. – datenwolf Feb 24 '12 at 15:14
  • @user1010005: What if all power plants on the world explode at once? – Sebastian Mach Feb 24 '12 at 16:05
  • 6
    Bootstrapping is cool. How do you make a flat surface without a flat surface? How do you get machine tools in a world with no machine tools? How do you make 32nm chips in a world without working chips? Make vehicles without vehicles? All these things are built on previous generations. – phkahler Feb 24 '12 at 17:17
  • 1
    @phresnel: We would have more grave problems than recompiling GCC if that happened, wouldn't we? – danielkza Feb 25 '12 at 21:50
  • 2
    @larsmans: As your Wikipedia link (and Stroustrop) describe, **the first C++ compiler was written in C++**, not C. At that time, there was already a tool that would convert C++ source to C source, making this chicken/egg scenario possible. – Drew Dormann Feb 28 '12 at 17:45
  • 1
    @DrewDormann: but then what do you call the tool that compiles C++ to C source? And what was that originally written in? (You could say C with Classes, but I consider that cheating :) – Fred Foo Feb 28 '12 at 19:02
  • 1
    @larsmans: From the horse's mouth: He wrote a **preprocessor** in C that converted C With Classes source to C source. Then he wrote a **compiler** in C++ that apparently didn't choke that preprocessor. http://www2.research.att.com/~bs/bs_faq.html#bootstrapping – Drew Dormann Feb 28 '12 at 21:14
  • @DrewDormann: I think this is a matter of definition (compiler/translator/preprocessor), but I amended the answer anyway. Thanks for the link. – Fred Foo Feb 28 '12 at 21:43
  • @larsmans: I completely agree. I'm no master of semantics, so I just cautiously refer to Stroustrup's own account. I'll take his word for it. – Drew Dormann Feb 29 '12 at 20:44
  • Why is stage 3 needed? How can stage 2 and 3 possibly differ (except for `__TIME__` of course, which should not matter)? – Ciro Santilli OurBigBook.com May 15 '15 at 09:53
  • 1
    @CiroSantilli新疆改造中心六四事件法轮功: Imagine the case of a backdoor; your stage 1 compiler (GCC, not the one doing the bootstrapping) recognizes a comment of "//KeywordChecks" which appears right before a series of if-elseif statements, so it adds in a chunk of instructions to the newly built binary that recognizes "//BADCODE" as a keyword and if found inserts a chunk of instructions that does devious stuff. Now your stage 2 compiler recognizes "//BADCODE" which the stage 1 compiler doesn't. So upon recompiling itself in step 3, if "//BADCODE" is a comment, then stage 2 and 3 will differ. – Ryan Jun 06 '18 at 00:21
  • Note that current versions of GCC are actually written in C++, so if all you start with is a C compiler, then you'll first need to build some other C++ compiler which is written in C. (An old version of G++ might work.) – Nate Eldredge Jan 13 '21 at 00:00
  • @Ryan and if "//BADCODE" isn't in the compiler's source? – nog642 Oct 13 '21 at 20:36
  • @nog642 pick something that does appear in the compiler's source. In case it wasn't obvious, that was just used as an example of a placeholder string to search for and replace. – Ryan Jul 07 '23 at 19:59
23

If you want to replicate the bootstrap process of GCC in a modern environment (x86 Linux), you can use the tools developed by the bootstrappable project:

  • We can start with hex0 assembler (on x86 it's 357 byte binary) which does roughly what the following two commands do

    sed 's/[;#].*$//g' hex0_x86.hex0 | xxd -r -p > hex0
    chmod +x hex0
    

    I.e. it translates ASCII equivalent of binary program into binary code, but it is written in hex0 itself.

    Basically, hex0 has equivalent source code that is in one to one correspondence to its binary code.

  • hex0 can be used to build a slighly more powerful hex1 assembler that supports a few more features (one character labels and calculates offsets). hex1 is written in hex0 assembly.

  • hex1 can be used to build hex2 (even more advanced assembler that supports multi character labels).

  • hex2 then can be used to build a macro assembler (where program using macros instead of hex opcodes).

  • You can then use thismacro assembler to build cc_x86 which is a "C compiler" written in assembly. cc_x86 only supports a small subset of C but that's an impresive start.

  • You can use cc_x86 to build M2-Planet (Macro Platform Neutral Transpiler) which is a C compiler written in C. M2-Planet is self hosting and can build itself.

  • You can then use M2-Planet to build GNU Mes which is a small scheme interpreter.

  • mes can be used to run mescc which is a C compiler written in scheme and lives in the same repository as mes.

  • mescc can be used to rebuild mes and also build mes C library.

  • Then mescc can be used to build a slighly patched Tiny C compiler.

  • Then you can use it to build newer version of TCC 0.9.27.

  • GCC 4.0.4 and musl C library can be built with TCC 0.9.27.

  • Then you can build newer GCC using older GCC. E.g. GCC 4.0.4 -> GCC 4.7.4 -> modern GCC.

TL;DR:

hex0 -> hex1 -> hex2 -> M0 -> M2-Planet -> Mes -> Mescc -> TCC -> GCC.

  • Your forgot you need to recompile modern GCC to get all the optimisation in C++ (gcc is written in C++ after all nowadays) of that modern GCC and not just 4.7.4. – Валерий Заподовников Jun 11 '22 at 09:12
  • Well, this is just a short summary. When you build modern GCC you can use its build system to recompile it automatically. And there are lots of missing steps here that don't deal with compilers, you need to build shell, binutils and many others. E.g. you can take a look at what live-bootstrap does https://github.com/fosslinux/live-bootstrap/blob/master/parts.rst though for now it only goes to g++ 4.7.4. – Andrius Štikonas Jun 11 '22 at 14:29
  • 1. Do you mean x86_64 or x86 32-bit? 2. What does it matter that M2 Planet is self-hosting? 3. Why is it necessary to rebuild mes and build the mes C library? 4. If GNU Mes is a C compiler already, why do we need to build TCC? i.e. why can't GCC be built with mes? 5. Why can't the patched-TCC be used to build GCC and musl, but is rather used to build an unpatched TCC? – einpoklum Jul 20 '23 at 07:59
  • It is 32-bit version. 64-bit support is close but not done. 2. Well, in this bootstrapping process it's the first self-hosting program, i.e. you can mostly forget about assembly by then and focus on more and more advanced compilers. 3. Because mes built with mescc is a bit more capable than mes built with M2-Planet (and is also faster). 4. Because mescc can't build more complicated software, if I remember, it doesn't even have floats. Also mescc is very slow, it takes 10 minutes to build tcc and GCC is muc larger. 5. This is a patched TCC 0.9.26 which is much older. TCC 0.9.27 is more capable. – Andrius Štikonas Jul 20 '23 at 14:31