I am researching CoffeeScript on the website http://coffeescript.org/, and it has the text
The CoffeeScript compiler is itself written in CoffeeScript
How can a compiler compile itself, or what does this statement mean?
I am researching CoffeeScript on the website http://coffeescript.org/, and it has the text
The CoffeeScript compiler is itself written in CoffeeScript
How can a compiler compile itself, or what does this statement mean?
The first edition of a compiler can't be machine-generated from a programming language specific to it; your confusion is understandable. A later version of the compiler with more language features (with source rewritten in the first version of the new language) could be built by the first compiler. That version could then compile the next compiler, and so on. Here's an example:
Note: I'm not sure exactly how CoffeeScript versions are numbered, that was just an example.
This process is usually called bootstrapping. Another example of a bootstrapping compiler is rustc
, the compiler for the Rust language.
In the paper Reflections on Trusting Trust, Ken Thompson, one of the originators of Unix, writes a fascinating (and easily readable) overview of how the C compiler compiles itself. Similar concepts can be applied to CoffeeScript or any other language.
The idea of a compiler that compiles its own code is vaguely similar to a quine: source code that, when executed, produces as output the original source code. Here is one example of a CoffeeScript quine. Thompson gave this example of a C quine:
char s[] = {
'\t',
'0',
'\n',
'}',
';',
'\n',
'\n',
'/',
'*',
'\n',
… 213 lines omitted …
0
};
/*
* The string s is a representation of the body
* of this program from '0'
* to the end.
*/
main()
{
int i;
printf("char\ts[] = {\n");
for(i = 0; s[i]; i++)
printf("\t%d,\n", s[i]);
printf("%s", s);
}
Next, you might wonder how the compiler is taught that an escape sequence like '\n'
represents ASCII code 10. The answer is that somewhere in the C compiler, there is a routine that interprets character literals, containing some conditions like this to recognize backslash sequences:
…
c = next();
if (c != '\\') return c; /* A normal character */
c = next();
if (c == '\\') return '\\'; /* Two backslashes in the code means one backslash */
if (c == 'r') return '\r'; /* '\r' is a carriage return */
…
So, we can add one condition to the code above…
if (c == 'n') return 10; /* '\n' is a newline */
… to produce a compiler that knows that '\n'
represents ASCII 10. Interestingly, that compiler, and all subsequent compilers compiled by it, "know" that mapping, so in the next generation of the source code, you can change that last line into
if (c == 'n') return '\n';
… and it will do the right thing! The 10
comes from the compiler, and no longer needs to be explicitly defined in the compiler's source code.1
That is one example of a C language feature that was implemented in C code. Now, repeat that process for every single language feature, and you have a "self-hosting" compiler: a C compiler that is written in C.
1 The plot twist described in the paper is that since the compiler can be "taught" facts like this, it can also be mis-taught to generate trojaned executables in a way that is difficult to detect, and such an act of sabotage can persist in all compilers produced by the tainted compiler.
You have already gotten a very good answer, however I want to offer you a different perspective, that will hopefully be enlightening to you. Let's first establish two facts that we can both agree on:
I'm sure you can agree that both #1 and #2 are true. Now, look at the two statements. Do you see now that it is completely normal for the CoffeeScript compiler to be able to compile the CoffeeScript compiler?
The compiler doesn't care what it compiles. As long as it's a program written in CoffeeScript, it can compile it. And the CoffeeScript compiler itself just happens to be such a program. The CoffeeScript compiler doesn't care that it's the CoffeeScript compiler itself it is compiling. All it sees is some CoffeeScript code. Period.
How can a compiler compile itself, or what does this statement mean?
Yes, that's exactly what that statement means, and I hope you can see now how that statement is true.
How can a compiler compile itself, or what does this statement mean?
It means exactly that. First of all, some things to consider. There are four objects we need to look at:
Now, it should be obvious that you can use the generated assembly - the executable - of the CoffeScript compiler to compile any arbitrary CoffeScript program, and generate the assembly for that program.
Now, the CoffeScript compiler itself is just an arbitrary CoffeScript program, and thus, it can be compiled by the CoffeScript compiler.
It seems that your confusion stems from the fact that when you create your own new language, you don't have a compiler yet you can use to compile your compiler. This surely looks like an chicken-egg problem, right?
Introduce the process called bootstrapping.
Now you need to add new features. Say you have only implemented while
-loops, but also want for
-loops. This isn't a problem, since you can rewrite any for
-loop in such a way that it is a while
-loop. This means you can only use while
-loops in the source code of your compiler, since the assembly you have at hand can only compile those. But you can create functions inside your compiler that can pase and compile for
-loops with it. Then you use the assembly you already have, and compile the new compiler version. And now you have an assembly of an compiler that can also parse and compile for
-loops! You can now go back to the source file of your compiler, and rewrite any while
-loops you don't want into for
-loops.
Rinse and repeat until all language features that are desired can be compiled with the compiler.
while
and for
obviously were only examples, but this works for any new language feature you want. And then you are in the situation CoffeScript is in now: The compiler compiles itself.
There is much literature out there. Reflections on Trusting Trust is a classic everyone interested in that topic should read at least once.
Here the term compiler glosses over the fact that there are two files involved. One is an executable which takes as input files written in CoffeScript and produces as its output file another executable, a linkable object file, or a shared library. The other is a CoffeeScript source file which just happens to describe the procedure for compiling CoffeeScript.
You apply the first file to the second, producing a third which is capable of performing the same act of compilation as the first (possibly more, if the second file defines features not implemented by the first), and so may replace the first if you so desire.
Since the Ruby version of the CoffeeScript compiler already existed, it was used to create the CoffeeScript version of the CoffeeScript compiler.
This is known as a self-hosting compiler.
It's extremely common, and usually results from an author's desire to use their own language to maintain that language's growth.
It's not a matter of compilers here, but a matter of expressiveness of the language, since a compiler is just a program written in some language.
When we say that "a language is written/implemented" we actually mean that a compiler or interpreter for that language is implemented. There are programming languages in which you can write programs that implement the language (are compilers/interpreters for the same language). These languages are called universal languages.
In order to be able to understand this, think about a metal lathe. It is a tool used to shape metal. It is possible, using just that tool, to create another, identical tool, by creating its parts. Thus, that tool is a universal machine. Of course, the first one was created using other means (other tools), and was probably of lower quality. But the first one was used to build new ones with higher precision.
A 3D printer is almost a universal machine. You can print the whole 3D printer using a 3D printer (you can't build the tip that melts the plastic).
The n+1th version of the compiler is written in X.
Thus it can be compiled by the nth version of the compiler (also written in X).
But the first version of the compiler written in X must be compiled by a compiler for X that is written in a language other than X. This step is called bootstrapping the compiler.
While other answers cover all the main points, I feel it would be remiss not to include what may be the most impressive example known of a compiler which was bootstrapped from its own source code.
Decades ago, a man named Doug McIlroy wanted to build a compiler for a new language called TMG. Using paper and pen, he wrote out source code for a simple TMG compiler... in the TMG language itself.
Now, if only he had a TMG interpreter, he could use it to run his TMG compiler on its own source code, and then he would have a runnable, machine-language version of it. But... he did have a TMG interpreter already! It was a slow one, but since the input was small, it would be fast enough.
Doug ran the source code on that paper on the TMG interpreter behind his eye sockets, feeding it the very same source as its input file. As the compiler worked, he could see the tokens being read from the input file, the call stack growing and shrinking as it entered and exited subprocedures, the symbol table growing... and when the compiler started emitting assembly language statements to its "output file", Doug picked up his pen and wrote them down on another piece of paper.
After the compiler finished execution and exited successfully, Doug brought the resulting hand-written assembly listings to a computer terminal, typed them in, and his assembler converted them into a working compiler binary.
So this is another practical (???) way to "use a compiler to compile itself": Have a working language implementation in hardware, even if the "hardware" is wet and squishy and powered by peanut butter sandwiches!
Compilers take a high-level specification and turn it into a low-level implementation, such as can be executed on hardware. Therefore there is no relationship between the format of the specification and the actual execution besides the semantics of the language being targeted.
Cross-compilers move from one system to another system, cross-language compilers compile one language specification into another language specification.
Basically compiling is a just translation, and the level is usually higher-level of language to lower-level of language, but there are many variants.
Bootstrapping compilers are the most confusing, of course, because they compile the language they are written in. Don't forget the initial step in bootstrapping which requires at least a minimal existing version that is executable. Many bootstrapped compilers work on the minimal features of a programming language first and add additional complex language features going forward as long as the new feature can be expressed using the previous features. If that were not the case it would require to have that part of the "compiler" be developed in another language beforehand.