4

I started using ANTLR today and I've created a basic parser.

After parsing I end up with a tree. To me it seems like this is just a bunch of Strings put together in a tree structure of Tree-nodes. That's not very useful to me. I'd like to have a graph of objects.

To clarify (this is an example, and not my real application): For "5-1+6" I seem to end up with:

new String("PLUS")
    new String("MINUS")
        new String("5")
        new String("1")
    new String("6")

What I would find more useful:

new Plus(
    new Minus(
        new IntegerLiteral(5),
        new IntegerLiteral(1)),
    new IntegerLiteral(6))

What is the most convenient way of going from the first representation to the other? In this article the author does something similar to this:

public Expression createExpr(CommonTree ast) {

    // ...

    switch (ast.getType()) {
    case SimpleExpressionParser.INT:
        return new IntegerLiteral(ast.getText())
    case SimpleExpressionParser.PLUS:
        return new Plus(createExpr((CommonTree)ast.getChild(0)),    // recurse
                        createExpr((CommonTree)ast.getChild(1)));   // recurse
    case SimpleExpressionParser.MINUS:
        return new Minus(createExpr((CommonTree)ast.getChild(0)),   // recurse
                         createExpr((CommonTree)ast.getChild(1)));  // recurse
    }

    // ...
}

Is this the preferred way?! Can't I instruct ANTLR to generate this boiler-plate code somehow (it will be huge)?


Possibly related questions:

Community
  • 1
  • 1
aioobe
  • 413,195
  • 112
  • 811
  • 826
  • You will always have to hand-roll a part of such boiler plate code, but it can be done in a (IMHO) better, or more readable way though. I haven't the time right now to answer you, but if someone hasn't done so by tomorrow, I'll probably post a suggestion. – Bart Kiers Mar 09 '11 at 22:27
  • thanks! I've read many of your answers while looking for an answer to this. I was hoping you'd see my question :-) I've discovered the `returns [MyOwnType var]` syntax, (which I suspect is on the right track) but still can't figure out how to pass the matched parts into the constructor. – aioobe Mar 09 '11 at 22:36
  • @aioobe, I realize that I might have made some short-cuts in my explanation (depending how much ANTLR you know). Feel free to ask for clarification(s) in the comments of my answer. – Bart Kiers Mar 10 '11 at 08:46
  • I was surprised by OP's first tree result; ANTLR really does this? What the OP wanted was trees with nodes marked by their types. OP's first example tree containing references to literal strings is a reasonble simulation of enums, that part seems OK. The leaf strings being the literal values screws up the enum idea. Kier's answer has a lot of machinery to make a more conventional tree... but why is all that machinery necessary? For an alternative view on how to build trees painlessly (but not with ANTLR), see http://www.semanticdesigns.com/products/DMS/SimpleDMSDomainExample.html – Ira Baxter Mar 10 '11 at 18:07
  • Each string is accompanied by an `int` representing the type of the node (if I understand it correctly). This however is still not very useful as I see it. – aioobe Mar 10 '11 at 18:34
  • @aioobe: so, you've got a tree and each node has an integer that identifies it as to its type. What's the issue then? (Now I don' see the point of the strings on the interior nodes). – Ira Baxter Mar 11 '11 at 03:32
  • Suppose you have "123.45" as your content-string, and you have some custom treatement for doubles. Now every single time you need to for instance check if it's greater than 0 or non-integer or what ever, you need to *parse* the string, using say, `Double.valueOf`. If you had transformed this to a `DoubleNode` instead of a generic `(String, int)`-node, you could just do `doubleNode.doubleValue()` or, even better, `doubleNode.isGreaterThanZero()` or `doubleNode.isInteger()`. (Note that I'm still new to this, and I may have overlooked something! If so, please enlighten me!) – aioobe Mar 11 '11 at 08:42
  • @aioobe: Ah, yes, you also want value-carrying lexemes to have their content converted to native values (float -> real, hex number -> integer, character string with embedded escapes to the actual string, ...). I agree this is really convenient. I don't know if ANTLR does this. My previous comment referenced DMS; DMS trees have values stored the way you want, and it is pretty easy to convert them from/to strings; see the DMS example I referenced. – Ira Baxter Mar 13 '11 at 22:00
  • @Ira, no, by default, ANTLR either produces CommonToken or CommonTree instances. These objects don't have convenience methods to convert their contents to native values. But you can easily tell ANTLR to use a custom token class that either extends CommonToken or CommonTree in which you yourself put this functionality. I find it easier though to create custom tokens/tree-nodes inside the tree grammar as demonstrated in my answer. – Bart Kiers Mar 14 '11 at 12:57
  • @Bart: That seems pretty reasonable as a solution. So now we're back to precise the nature of @aiiobe's complaint: Is it just that ANTLR's default representation requires him to specify a bit more? (e.g., code the convenience conversions)? – Ira Baxter Mar 14 '11 at 19:10

1 Answers1

23

Here's a a possible way. In short, these are the step you'd perform:

  1. create a combined grammar that will generate the lexer and parser;
  2. mix AST rewrite rules in the grammar from (1) to transform the flat list of tokens into a proper tree;
  3. write a tree grammar that can walk the tree from (2);
  4. mix custom code inside your tree walker;
  5. test it.

1

Let's create a small expression parser supporting +, -, *, /, (...) and numbers, which could look like:

grammar Exp; // file: Exp.g

eval
  :  exp EOF
  ;

exp 
  :  addExp 
  ;

addExp
  :  mulExp ((Add | Sub) mulExp)*
  ;

mulExp
  :  unaryExp ((Mul | Div) unaryExp)*
  ;

unaryExp
  :  Sub atom
  |  atom
  ;

atom
  :  Number
  |  '(' exp ')'
  ;

Add    : '+';
Sub    : '-';
Mul    : '*';
Div    : '/';
Number : '0'..'9'+;
Space  : ' ' {skip();};

2

Including rewrite rules, it will look like:

grammar Exp; // file: Exp.g

options {
  output=AST;
}

tokens {
  U_SUB;
}

eval 
  :  exp EOF -> exp
  ;

exp 
  :  addExp 
  ;

addExp
  :  mulExp ((Add | Sub)^ mulExp)*
  ;

mulExp
  :  unaryExp ((Mul | Div)^ unaryExp)*
  ;

unaryExp
  :  Sub atom -> ^(U_SUB atom)
  |  atom
  ;

atom
  :  Number
  |  '(' exp ')' -> exp
  ;

Add    : '+';
Sub    : '-';
Mul    : '*';
Div    : '/';
Number : '0'..'9'+;
Space  : ' ' {skip();};

Now an expression like 10 - 2 * (3 + 8) will be transformed to:

enter image description here

3

To create a tree grammar that generates an iterator for the AST generated in (2), you'd do something like this:

tree grammar ExpWalker; // file: ExpWalker.g

options {
  tokenVocab=Exp; // use the tokens from Exp.g
  ASTLabelType=CommonTree;
}

eval
  :  exp
  ;

exp
  :  ^(Add exp exp)
  |  ^(Sub exp exp)
  |  ^(Mul exp exp)
  |  ^(Div exp exp)
  |  ^(U_SUB exp)
  |  Number
  ;

4

And to mix your custom classes in this tree iterator, do something like this:

tree grammar ExpWalker; // file: ExpWalker.g

options {
  tokenVocab=Exp; // use the tokens from Exp.g
  ASTLabelType=CommonTree;
}

eval returns [ExpNode e]
  :  exp {e = $exp.e;}
  ;

exp returns [ExpNode e]
  :  ^(Add a=exp b=exp) {e = new AddExp($a.e, $b.e);}
  |  ^(Sub a=exp b=exp) {e = new SubExp($a.e, $b.e);}
  |  ^(Mul a=exp b=exp) {e = new MulExp($a.e, $b.e);}
  |  ^(Div a=exp b=exp) {e = new DivExp($a.e, $b.e);}
  |  ^(U_SUB a=exp)     {e = new UnaryExp($a.e);}
  |  Number             {e = new NumberExp($Number.text);}
  ;

5

Here's some code to test all classes (stick it all just in one file: Main.java):

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = "10 - 2 * (3 + 8)";
    ExpLexer lexer = new ExpLexer(new ANTLRStringStream(source));
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    ExpParser parser = new ExpParser(tokens);
    ExpParser.eval_return returnValue = parser.eval();
    CommonTree tree = (CommonTree)returnValue.getTree();
    CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
    ExpWalker walker = new ExpWalker(nodes);
    ExpNode root = walker.eval();
    System.out.println(source + " = " + root.evaluate());
  }
}

interface ExpNode {
  double evaluate();
}

class NumberExp implements ExpNode {

  final double num;

  NumberExp(String s) {
    num = Double.parseDouble(s);
  }

  @Override
  public double evaluate() {
    return num;
  }
}

class AddExp implements ExpNode {

  final ExpNode left, right;

  AddExp(ExpNode a, ExpNode b) {
    left = a;
    right = b;
  }

  @Override
  public double evaluate() {
    return left.evaluate() + right.evaluate();
  }
}

class SubExp implements ExpNode {

  final ExpNode left, right;

  SubExp(ExpNode a, ExpNode b) {
    left = a;
    right = b;
  }

  @Override
  public double evaluate() {
    return left.evaluate() - right.evaluate();
  }
}

class MulExp implements ExpNode {

  final ExpNode left, right;

  MulExp(ExpNode a, ExpNode b) {
    left = a;
    right = b;
  }

  @Override
  public double evaluate() {
    return left.evaluate() * right.evaluate();
  }
}

class DivExp implements ExpNode {

  final ExpNode left, right;

  DivExp(ExpNode a, ExpNode b) {
    left = a;
    right = b;
  }

  @Override
  public double evaluate() {
    return left.evaluate() / right.evaluate();
  }
}

class UnaryExp implements ExpNode {

  final ExpNode exp;

  UnaryExp(ExpNode e) {
    exp = e;
  }

  @Override
  public double evaluate() {
    return -exp.evaluate();
  }
}

and then do:

# generate a lexer & parser
java -cp antlr-3.2.jar org.antlr.Tool Exp.g

# generate the tree walker
java -cp antlr-3.2.jar org.antlr.Tool ExpWalker.g

# compile everything
javac -cp antlr-3.2.jar *.java

# run the main class
java -cp .:antlr-3.2.jar Main         # *nix 
java -cp .;antlr-3.2.jar Main         # Windows

which prints:

10 - 2 * (3 + 8) = -12.0

You could skip the tree-walker and mix all the code and returns [...] inside your combined grammar, but IMO, a tree grammar keeps things more orderly because the lexer rules, and tokens like ( and ) etc. are removed from it.

HTH

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Wow. What an excellent answer. Thanks a lot! I've probably read it about five times now and I think I've grasped most of the details. One key element in your solution is the `tokenVocab=Exp;`. In my real application, I'll have statements and expressions. How do I solve this? Separate grammar-files that link together somehow? – aioobe Mar 10 '11 at 09:59
  • @aioobe, yeah, that was a bit confusing... The `tokenVocab=Exp;` tells the tree walker to use the tokens produced by the grammar `Exp.g`. The naming of the interface `Exp` was misleading: I renamed it in my example and did a quick test to see if it all still works (which it does). And you're welcome, of course! – Bart Kiers Mar 10 '11 at 10:30
  • Thanks Bart. I'm trying out your example right now. Terrence Parr really seem to [discourage the use of this type of object graph](http://www.jguru.com/faq/view.jsp?EID=818959) and recommend just using the untyped parse trees instead. Do you share his opinion on this? I think it should work quite well, especially since I plan to use it together with Scala and its pattern-matching facilities. I mean, I'm going to write a full compiler (for a small language). Surely I'll need something more sophisticated than an untyped tree at some point. – aioobe Mar 10 '11 at 13:22
  • 2
    @aioobe, I must admit, I don't fully understand Terence's point(s). He's against mixing code inside the parser (that's my `Exp.g` grammar), so I'm okay there. He recommends doing the actual evaluating inside the tree grammar (still okay there! :)), but I'm not sure if he's against creating specific nodes _in_ the tree grammar like I just did. In that case, I don't agree: personally I like to keep my grammars as empty as possible, and separate functionality in specific classes. I've written a small language as well with for- and while statements, scopes and the likes, similar as my example ... – Bart Kiers Mar 10 '11 at 13:52
  • ... above, and that works like a charm: my language can be easily enhanced with new construct/functionality like this. And bugs are traced pretty quickly as well. – Bart Kiers Mar 10 '11 at 13:53
  • Thanks a lot! This gives me confidence with going along with this approach! Thanks. (Now I'm trying to mix a statement-grammar with an expression-grammar. I'll post another question if I run into trouble here!) Thanks again! – aioobe Mar 10 '11 at 14:10
  • 1
    @aioobe, no problem. Note that, for example, an `assignment` rule (or some other statement) could also be considered of type `Exp` but just one that returns a sort of predefined `void` value instead. Best of luck with your language! – Bart Kiers Mar 10 '11 at 15:07
  • I wasn't even looking for this question, but took the time to read your answer anyway. That is an excellent step by step tutorial for tree building actions, thanks for writing this. – Andrew Jan 23 '13 at 00:43
  • Thanks @BartKiers, 7 years later!...still this answer have cleared my doubts! – parth Aug 10 '18 at 07:11
  • Ditto; in mid-2019, this was still an excellent, highly-informative answer that saved me a lot of groping in the dark. – Syzygy May 18 '19 at 05:27