2

I already have an interpreter for my language. It is implemented with:

  • parser -> scala parser combinators;
  • AST -> scala case classes;
  • evaluator -> scala pattern matching.

Now I want to compile AST to native code and hopefully Java bytecode. I am thinking of two main options to accomplish at least one of these two tasks:

  • generate LLVM IR code;
  • generate C code and/or Java code;

obs.: GCJ and SLEM seem to be unusable (GCJ works with simple code, as I could test)

2 Answers2

1

Short Answer

I'd go with Java Bytecode.

Long Answer

The thing is, the higher-level the language you compile to,

  1. The slower and more cumbersome the compilation process is
  2. The more flexibility you get

For instance, if you compile to C, you can then get a lot of possible backends for C compilers - you can generate Java Bytecode, LLVM IR, asm for many architectures, etc., but you basically compile twice. If you choose LLVM IR you're already halfway to compiling to asm (parsing LLVM IR is far faster than parsing a language such as C), but you'll have a very hard time getting Java Bytecode from that. Both intermediate languages can compile to native, though.

I think compiling to some intermediate representation is preferable to compiling to a general-purpose programming language. Between LLVM IR and Java Bytecode I'd go with Java Bytecode - even though I personally like LLVM IR better - because you wrote that you basically want both, and while you can sort of convert Java Bytecode to LLVM IR, the other direction is very difficult.

The only remaining difficulty is translating your language to Java Bytecode. This related question about tools that can make it easier might help.

Finally, another advantage of Java Bytecode is that it'll play well with your interpreter, effectively allowing you to easily generate a hotspot-like JITter (or even a trace compiler).

Community
  • 1
  • 1
Oak
  • 26,231
  • 8
  • 93
  • 152
  • It seems easy with the mentioned CafeBabe, but I am afraid of the "sort of compile to llvm IR". –  Mar 20 '13 at 17:11
  • @davips that's a valid concern. Translating Bytecode to LLVM IR by itself isn't hard, but you also have to worry about various runtime services that a JVM usually provides, particularly the garbage collector. It's just that the other way around - LLVM IR to Bytecode - is pretty difficult, as LLVM IR can contain code which will be hard to modify to pass JVM verification. I guess you can maintain two different back-ends :) – Oak Mar 20 '13 at 19:58
  • tnx for all clarification, maybe I am going farther than I can. I played a little with CafeBabe, and even being easier than writing bytecodes by hand, I think I should try the approach that emits Java and C code. It avoids a lot of problems. As soon as the C back-end is finished, it will be easy to emit similar equivalent Java code. –  Mar 21 '13 at 04:39
  • One major drawback is that the user will need to have the entire JDK, not only the JRE to run my compiler. –  Mar 21 '13 at 04:42
  • @davips as I've written above, the major drawback to compiling to C or Java is that you basically compile twice. If this is a non-issue for you, that this *is* a valid solution. Also take a look at [cybil](https://code.google.com/p/cibyl/), which can compile C to Java Bytecode - maybe it's good enough that you won't have to maintain two backends. – Oak Mar 21 '13 at 07:23
  • I will try it. NestedVM seems similar. –  Mar 21 '13 at 14:37
  • A mixed approach may be a good solution to avoid the need for a JDK at the client, avoid double compilation and speed up things: I can provide the complex part precompiled into class files and emit only the easy bytecodes. –  Mar 22 '13 at 00:57
0

I agree with @Oak about the choice of ByteCode as the most simple target. A possible Scala library to generate ByteCode is CafeBabe by @psuter.

You cannot do everything with it, but for small project it could be sufficient. The syntax is also very clear. Please see the project Wiki for more information.

Lomig Mégard
  • 1,828
  • 14
  • 18