3

I would like to create my own programming language targeting the JVM. I am unsure how to do this. Must I create my own compiler? Do all programming languages have unique compilers, or are there existing ones that can be adapted?

I have found some information about targeting the .NET CLI.

I've also found the Dragon Book on compiler design.

NickAldwin
  • 11,584
  • 12
  • 52
  • 67
Rohan Sethi
  • 115
  • 1
  • 2
  • 9
  • 2
    This question is too broad. But in general you will have to first write a parser to parse your language and perhaps compile it into some intermediate representation (like an AST for example, or some intermediate language). After that you will have to translate your intermediate representation into bytecode. – Vivin Paliath Sep 23 '14 at 15:56
  • Lots of languages target the JVM, other than Java: Scala, Clojure, Groovy, Jython...you should take a look at the source code for their compilers. – Andrew Mao Sep 23 '14 at 15:59
  • 1
    The llvm tutorial is excellent, it is quite short and very well written. I know that you said that you wanted to target the JVM, however nearly everything that this tutorial covers will help you in understanding the parts required (http://llvm.org/docs/tutorial/) – Chris K Sep 23 '14 at 16:00
  • 1
    I also found this on stackoverflow: http://stackoverflow.com/questions/1669/learning-to-write-a-compiler – Chris K Sep 23 '14 at 16:15
  • @ChrisK I referred that link before. But frankly, wasn't too sure it was a place to get things started. – Rohan Sethi Sep 23 '14 at 16:19
  • @RohanSethi no problem. Here is another for you, this library is the best parser library that I have come across that runs on the JVM. https://github.com/sirthias/parboiled. I would start there, and create a recogniser from it for a simple language that recognises '1+2' only. Then add an interpreter, and then generate either java source code to be compiled separately via javac or jvm byte code. Keeping the language simple, and going depth first will give you faster feed back and knowledge of each of the layers before you grow the language. Good luck. – Chris K Sep 23 '14 at 16:20
  • 1
    @RohanSethi learning to write compilers is quite a dark art, I was lucky enough to learn it at University. You will hit two problems, 1) the theory is very dense and difficult to penetrate and 2) your first twenty attempts will be a speghetti mess. :) – Chris K Sep 23 '14 at 16:21
  • @ChrisK I do have the subject in next sem at university. But it's to long to hang on.Really appreciate your efforts.Thanks. – Rohan Sethi Sep 23 '14 at 16:25
  • @Rohan: You'll waste a lot of time trying to build a compiler without the class. Suggest you hold your breath and take that class. You're not likely to die of old age soon; you have some time. – Ira Baxter Sep 23 '14 at 17:57
  • Will consider you suggestion @IraBaxter . Thanks :) – Rohan Sethi Sep 23 '14 at 18:03

3 Answers3

7

Yes, every language has their own compiler. There are a few types of compiler that can be written, each one gets more complicated and builds on the previous:

  1. recogniser, only answers whether the input source valid syntax,
  2. parser, creates an inmemory representation of the input source (called an AST - abstract syntax tree),
  3. compiler (generates a translated form of the input),
  4. optimising compiler, as 3 but optimises the AST before generating the output.

All of these compiler forms usually reuse tools that are specially designed to help with different stages of compilation. Which briefly are:

Parsing: I would recommend parboiled for Java. Older tools used to be variants of lex and yacc, two unix tools for the lexical and grammer stages of parsing. ANTLR and Javacc are two examples that run on the JVM; however parboiled is just awesome.

AST: I do not know of any tool here, one can reuse a model from another JVM language such as javac however I would personally create this myself.

Output Generation: A quick approach is to generate Java source code, which has some limitations but is overall an excellent approach for testing the water. When/if you decide to move on to generating JVM byte codes, a collection of helper libraries can be found here. However there is a lot to learn about the JVM before attempting that route, the JVM spec/book by Oracle is a mandatory read.

For general knowledge, the llvm tutorial is excellent, it is quite short and very well written. I know that you said that you wanted to target the JVM, however nearly everything that this tutorial covers will help you in understanding the parts required.

I would recommend following the tutorial, and rewrite it using Java. Its steps are very logical. Essentially one would write a recogniser for a very simple language, such as '1+2' only. Then write an interpreter for that language. That would be a very reasonable stopping point, many languages are interpreted; Java started off its life like this too. Optionally one can then move on to emit a target output, say Java source code at first. The code for this would be fairly short, and will give you quicker feedback than trying to write any single layer in full first. There are many opportunities to consume your coding hours if you went down that road.

Community
  • 1
  • 1
Chris K
  • 11,622
  • 1
  • 36
  • 49
4

Chris K. gave quite a nice answer, however, in one point I (as a person that at least has already written a working compiler for a non-trivial JVM language) must strictly disagree:

The code generator should indeed generate just Java (or, if you like, Scala, Ceylon, Kotlin, Clojure, ... whatever you like) code in the beginning, for the following reasons:

  • the other tasks (lexing, parsing, maintaining the compiler state a.k.a symbol table, semantic analysis, etc.) are already demanding enough. Therefore, learning yet another library is overdoing it, and will delay your first results substantially.
  • Once you have everything including code generation and compile your first program, you'll find that your compiler is full of bugs, literally. It is much easier to see those bugs manifest themselves in non-sensical, or erronous, Java code, rather than erronous class files. Would you rather get a cryptic message from the byte code verifier or look at the generated code in text form?)
  • Code generation should be a separate module anyways, nothing in the rest of the compiler depends (or should depend) on code generation. So it is comparatively easy to replace it once you can be sure that your compiler indeed can make sense of its input (proof of which is compilable java code that stands some tests, etc. To be sure, as long as the class-file generation is not 100% fool proof, it should be an option whether code is generated in Java or in binary. This way, you can compile test programs to Java and to byte code, and run tests with both outcomes. This makes error analysis in cases where a generated class file suddenly fails quite easier.)

I personally would not even begin to generate class files until your compiler, written in your own language, can compile itself into java, and the resulting program can compile the compiler source to exactly the same java code.

Ingo
  • 36,037
  • 5
  • 53
  • 100
  • @rohan-sethi another important aspect in creation of a programming language is that you should 1st **design** your language. Think about it, try to write a few programs on-paper and "compile" them to Java on-paper manually. Once you're happy with the language design, another important and useful step is writing a http://en.wikipedia.org/wiki/Bootstrapping_(compilers) – xmojmr Oct 23 '14 at 13:04
3

The easiest way is to use MPS http://www.jetbrains.com/mps/ + you will have IDE support as a bonus

  • 2
    IDE support is a big win with MPS, the tutorial videos are also quite informative. It also comes with a Java grammer, which one can modify to jump start their own language. It is well worth a look. – Chris K Sep 24 '14 at 07:45