Left-factoring grammar of coffeescript expressions

Question

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.

My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.

grammar CoffeeScript;

options {
  output=AST;
}

tokens {
  AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}

COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN  :   '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER  :   ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof'; 
RPAREN  :   ')';
SHIFT : '<<' | '>>';
STRING  :   '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER :    ('a'..'z' | 'A'..'Z')+;

WS  :   (' ')+ {skip();} ;

root
  : body
  ;

body
  : line
  ;

line
  : expression
  ;

assign
  : assignable EQUAL expression
  ;

expression
  : value
  | assign
  | operation
  ;

identifier
  : IDENTIFIER
  ;

simpleAssignable
  : identifier
  ;

assignable
  : simpleAssignable
  ;

value
  : assignable
  | literal
  | parenthetical
  ;

literal
  : alphaNumeric
  ;

alphaNumeric
  : NUMBER 
  | STRING;

parenthetical
  : LPAREN body RPAREN
  ;

// term should be the same as expression except operation to avoid left-recursion
term
  : value
  | assign
  ;

questionOp
  : term QUESTION?
  ;

mathOp
  : questionOp (MATH questionOp)*
  ;

additiveOp
  : mathOp ((PLUS | MINUS) mathOp)*
  ;

shiftOp
  : additiveOp (SHIFT additiveOp)*
  ;

relationOp
  : shiftOp (RELATION shiftOp)*
  ;

compareOp
  : relationOp (COMPARE relationOp)*
  ;

logicOp
  : compareOp (LOGIC compareOp)*
  ;

operation
  : UNARY expression
  | MINUS expression
  | PLUS expression
  | MINUS_MINUS simpleAssignable
  | PLUS_PLUS simpleAssignable
  | simpleAssignable PLUS_PLUS
  | simpleAssignable MINUS_MINUS
  | simpleAssignable COMPOUND_ASSIGN expression
  | logicOp
  ;

UPDATE: The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:

CompareOp returns Operation:
  AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;

My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.

I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.

UPDATE2 and SOLUTION: I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.

grammar FunExp;

ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};

root
  : expression
  ;

// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
  : operation
  ;

atom
  : NUMBER
  | ID
  ;

functionCall
  : ID '(' expression (',' expression)* ')'
  ;

operation
  : multiOp
  ;

multiOp
  : additiveOp (('*' | '/') additiveOp)*
  ;

additiveOp
  : term (('+' | '-') term)*
  ;

term
  : atom
  | functionCall
  | '(' expression ')'
  ;

You should add your grammar to https://github.com/antlr/grammars-v4 — It would be a very valuable source! :) — Benny Code, Sep 02 '16 at 15:44

Bart Kiers · Accepted Answer · 2011-11-25T10:40:08.247

When you generate a lexer and parser from your grammar, you see the following error printed to your console:

error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.

warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3

As a result, alternative(s) 3 were disabled for that input

(I've emphasized the important bits)

This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.

The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:

expression
  : value      // choice 1
  | assign     // choice 2
  | operation  // choice 3
  ;

Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:

choice 1:

expression
  value
    literal
      alphaNumeric : {NUMBER, STRING}

choice 3:

expression
  operation
    logicOp
      relationOp
        shiftOp
          additiveOp
            mathOp
              questionOp
                term
                  value
                    literal
                      alphaNumeric : {NUMBER, STRING}

In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).

So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.

I noticed 3 other things in your grammar:

1

Take the following lexer rules:

NEW : 'new';
...
UNARY : '!' | '~' | NEW;

Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:

UNARY : '!' | '~' | 'new';

2

In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:

LOGIC : '&&' | '||';

and then you use that token in a parser rules like this:

logicOp
  : compareOp (LOGIC compareOp)*
  ;

But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):

AND : '&&';
OR  : '||';

...

logicOp
  : compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
              | OR  compareOp // easier to evaluate, you know it's an OR expression
              )*
  ;

3

You're skipping white spaces (and no tabs?) with:

WS  :   (' ')+ {skip();} ;

but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?

I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.

Your explanation about the error messages and the parser paths helped me a lot understanding the problem and made the cryptic-looking Antlr messages more meaningful. wrt to your comments on the grammar, I'm not interested in the lexer part yet (see my update). I still find your first comment useful, I wasn't aware of that. — Adam Schmideg, Nov 25 '11 at 14:42
@Adam, I guessed your grammar had the ambiguity, but I was wrong: CoffeeScript is full of ambiguities! It appears a single identifier is a valid statement (as an implicit return value) causing the parser to possibly parse a single identifier as a return-statement, the start of an assignment, or a function-call (to name just 3 possibilities). It's a horrible language to parse, but after playing around with it a bit, it does make JavaScript so much more fun! A cute little language to work with! Really nice. — Bart Kiers, Nov 25 '11 at 19:17
@AdamSchmideg, a final word of caution: I also see that input like `f 1 + f 2` is valid and should be interpreted as `f(1 + f(2))`. An LR parser which matches input from right to left will have little trouble with this, but trying to do so with an LL parser, will get clunky. But there are many more things in the original CoffeeScript grammar that won't work: left recursion in the LR grammar won't work at all in the LL grammar, for example. I'm sure it's not impossible, but it's going to get tough to write a decent CoffeeScript grammar in ANTLR syntax. Best of luck, of course! — Bart Kiers, Nov 25 '11 at 21:58
thank you for your comments and advices. I guess I'm going to post more questions along the way :) — Adam Schmideg, Nov 26 '11 at 16:42

Left-factoring grammar of coffeescript expressions

1 Answers1

choice 1:

choice 3:

1

2

3