104

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either

  • writing an initial compiler in a different language.
  • hand-coding an initial compiler in Assembly, which seems like a special case of the first

To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?

Mark Harrison
  • 297,451
  • 125
  • 333
  • 465
pbh101
  • 10,203
  • 9
  • 32
  • 31
  • 2
    Thanks for the info, everyone. When explained with the idea of initially writing a limited compiler, then building up on top of that, then the idea of bootstrapping makes more sense. I'm taking a Compilers class this semester, a decision largely influenced by [Steve Yegge's post on how important a class in Compilers](http://steve-yegge.blogspot.com/2007/06/rich-programmer-food.html) is, and I just bought a copy of the Dragon book from the Amazon link that got so downmodded on SO earlier. – pbh101 Aug 17 '08 at 22:06
  • 2
    See also similar question: [Implementing a compiler in itself](http://stackoverflow.com/questions/193560/implementing-a-compiler-in-itself) – Urban Vagabond Nov 02 '13 at 07:40
  • 1
    I'm not very experienced with such things, but I would assume that the _initial_ compiler would have to be written in another language. I'm fairly certain that "bootstrapping", in reference to compilers, simply refers to writing _a_ compiler for a language in the language it's meant to compile, not writing _the first_ compiler for the language in the language it's meant to compile. – jdd Aug 17 '08 at 06:57

11 Answers11

114

Is there a way to actually write a compiler in its own language?

You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.

If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.

There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.

MD XF
  • 7,860
  • 7
  • 40
  • 71
Derek Park
  • 45,824
  • 15
  • 58
  • 76
23

The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):

  • Write a compiler C1 for language X in language Y
  • Use the compiler C1 to write compiler C2 for language X in language X
  • Now C2 is a fully self hosting environment.
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark Harrison
  • 297,451
  • 125
  • 333
  • 465
8

The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.

This is the definition of bootstrapping:

the process of a simple system activating a more complicated system that serves the same purpose.

EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.

Eric Haskins
  • 8,505
  • 12
  • 38
  • 47
7

A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.

He starts off with:

What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.

and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.

The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Mark Harrison
  • 297,451
  • 125
  • 333
  • 465
5

Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MauganRa
  • 494
  • 6
  • 8
5

Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
none
  • 5,701
  • 28
  • 32
4

Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.

How else would it work? I don't think it's even conceptually possible to do otherwise.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Adam Lassek
  • 35,156
  • 14
  • 91
  • 107
  • 6
    The first Lisp compiler, at least, was bootstrapped using an existing Lisp *interpreter*. So not another language semantically, but another language implementation. – Ken Mar 06 '10 at 21:52
4

As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.

You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
luser droog
  • 18,988
  • 3
  • 53
  • 105
  • Whatever happened to parts 2 and 3? ... How did I not notice that @Wing posted the same thing 3 years before me? I'm a dunce. At least I linked the paper (with help). – luser droog Mar 30 '13 at 06:33
3

It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.

Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Wing
  • 5,988
  • 3
  • 21
  • 12
3

Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.

Pete Kirkham
  • 48,893
  • 5
  • 92
  • 171
1

Some bootstrapped compilers or systems keep both the source form and the object form in their repository:

  • ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.

  • The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.

  • MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.

  • J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.

  • Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547