5

I'm trying to parse a language using ANTLR which can contain the following syntax:

someVariable, somVariable.someMember, functionCall(param).someMember,  foo.bar.baz(bjork).buffalo().xyzzy

This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error

The following sets of rules are mutually left-recursive [access_operation, expression]:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

tokens {
  LHS;
  RHS;
  CALL;
  PARAMS;
}

start   
  :  body? EOF
  ;

body
  : expression (',' expression)*
  ;

expression
  : function -> ^(CALL)
  | access_operation
  | atom
  ;

access_operation
  : (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
  ;

function
  : (IDENT '(' body? ')') -> ^(IDENT PARAMS?) 
  ;         

atom
  : IDENT
  | NUMBER
  ;

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : '0'..'9';

IDENT    : (LETTER)+ ;
NUMBER   : (DIGIT)+ ;
SPACE    : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };

What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.

What i'm looking for instead is something like this:

enter image description here

How can the left-recursion problem solved in this case?

pulse00
  • 1,294
  • 1
  • 16
  • 25
  • 1
    What would you want your grammar to do when given the input `foo.bar.baz(bjork).buffalo().xyzzy` ? – sarnold Jan 02 '12 at 02:00
  • 1
    *"which generates the wrong AST though"*, how did this wrong AST look like? How *should* the AST look like instead? – Bart Kiers Jan 02 '12 at 08:11
  • 2
    As a side note, ANTLR's next major release (v4) will also be able to handle left recursion! The exact release date is unknown to me, but I've played around with an early release, and it's very cool! – Bart Kiers Jan 02 '12 at 08:54

1 Answers1

7

By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:

    foo
   / | \
  /  |  \
bar baz  ...

That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.

A quick demo:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

tokens {
  BODY;
  ACCESS;
  CALL;
  PARAMS;
}

start   
 : body EOF -> body
 ;

body
 : expression (',' expression)* -> ^(BODY expression+)
 ;

expression
 : atom
 ;         

atom
 : NUMBER
 | (IDENT -> IDENT) ( tail       -> ^(IDENT tail)
                    | call tail? -> ^(CALL IDENT call tail?)
                    )?
 ;

tail
 : (access)+
 ;

access
 : ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
 ;

call
 : '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
 ;

IDENT  : LETTER+;
NUMBER : DIGIT+;
SPACE  : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : '0'..'9';

which can be tested with:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " + 
        "foo.bar.baz(bjork).buffalo().xyzzy";
    TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
    TestParser parser = new TestParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.start().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

The output of Main corresponds to the following AST:

enter image description here

EDIT

And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

tokens {
  BODY;
  ACCESS_OP;
  CALL;
  PARAMS;
  LHS;
  RHS;
}

start   
 : body EOF -> body
 ;

body
 : expression (',' expression)* -> ^(BODY expression+)
 ;

expression
 : atom
 ;         

atom
 : NUMBER
 | (ID -> ID) ( ('(' params ')' -> ^(CALL ID params)) 
                ('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
              | '.' expression  -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
              )?
 ;

params
 : (expression (',' expression)*)? -> ^(PARAMS expression*)
 ;

ID     : LETTER+;
NUMBER : DIGIT+;
SPACE  : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};

fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT  : '0'..'9';

which creates the following AST if you run the Main class:

enter image description here

The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:

enter image description here

which means atom can be any of the 5 following alternatives (with their corresponding AST's):

+----------------------+--------------------------------------------------------+
| alternative          | generated AST                                          |
+----------------------+--------------------------------------------------------+
| NUMBER               | NUMBER                                                 |
| ID                   | ID                                                     |
| ID params            | ^(CALL ID params)                                      |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression        | ^(ACCESS_OP ^(LHS ID) ^(RHS expression)                |
+----------------------+--------------------------------------------------------+
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • 1
    thanks for the in-depth answer. the modified grammar resolves the left-recursion problem. I was not very clear about what i meant with "wrong AST" though. The API i'm using for AST traversal provides a class `FieldAccess` with the signature `ASTNode dispatcher, ASTNode field`. The AST i am trying to create should be compatible with the API in this [example](https://gist.github.com/1550258). This means that the `access` rule should create an ASTNode which contains both the left and the right side of the access. And this re-introduces the left recursion problem. – pulse00 Jan 02 '12 at 10:59
  • but maybe i'm on the wrong path here and i can get access to the `dispatcher` of the `access` rule using other means in ANTLR using your modified grammar. – pulse00 Jan 02 '12 at 11:03
  • i've updated the question and added the AST i'm trying to create. – pulse00 Jan 02 '12 at 15:40
  • 1
    the idea to move the `ID` on the left part of the `atom` rule and all the rest as an optional second token to the right was great. It now produces the exact AST the API is expecting. By the way, the API i'm talking about is the [Dynamic Language Toolkit](http://eclipse.org/dltk/) which provides means to implement scripting languages in eclipse. – pulse00 Jan 02 '12 at 22:17