Why is antlr4 parsing my sentence into two statements?

Question

I am writing a little parser for expressions. At the moment I just want it to recognize binary multiplications (myId * myId) and C-like dereferenced pointers (*myId), plus some assignation statements (myId *= myId).

The input that makes the parser throw errors is:

x *= y;

... on which the parser fails with this message and parse tree:

[line 1:1 mismatched input ' *' expecting {';', NEWLINE}]
(sourceFile (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression x))))  * =  ) (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression y)))) ;) <EOF>)

I've been scratching my head for a while but I can't see what is wrong in my grammar (see it below). Any hints, please? Thanks in advance.

grammar Sable;

options {}

@header {
    package org.sable.parser;
}

ASSIGNMENT_OP:
    '='
    ;

BINARY_OP:
    '*'
    ;

WS_BUT_NOT_NEWLINE:
    WhiteSpaceButNotNewLineCharacter
    ;

NEWLINE:
    ('\u000D' '\u000A')
    | '\u000A'
    ;

WSA_BINARY_OP:
    (WS_BUT_NOT_NEWLINE+ BINARY_OP WS_BUT_NOT_NEWLINE+)
    | BINARY_OP
    ;

WSA_PREFIX_OP:
    (WS_BUT_NOT_NEWLINE+ '*' )
    ;

WS  :  WhiteSpaceCharacter+ -> skip
    ;

IDENTIFIER:
    (IdentifierHead IdentifierCharacter*)
    | ('`'(IdentifierHead IdentifierCharacter*)'`')
    ;

// NOTE: a file with zero statements is allowed because
// it can contain just comments.
sourceFile:
    statement* EOF;

statement:
    expressionStatement (';' | NEWLINE);

// Req. not existing any valid expression starting from
// an equals sign or any other assignment operator.
expressionStatement:
    expression (assignmentOperator expression)?;

expression:
    monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

monoOperatedExpression:
    atomicExpression
    ;

binaryOperator:
    WSA_BINARY_OP
    ;

atomicExpression:
    IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
    ;

type:
    IDENTIFIER
    ;

assignmentOperator:
    ASSIGNMENT_OP
    ;

fragment DecimalDigit:
    '0'..'9'
    ;

fragment IdentifierHead:
    'a'..'z'
    | 'A'..'Z'
    ;
fragment IdentifierCharacter:
    DecimalDigit
    | IdentifierHead
    ;

fragment WhiteSpaceCharacter:
    WhiteSpaceButNotNewLineCharacter
    | NewLineCharacter;

fragment WhiteSpaceButNotNewLineCharacter:
    [\u0020\u000C\u0009u000B\u000C]
    ;

fragment NewLineCharacter:
    [\u000A\u000D]
    ;

EDIT: adding a new version of the grammar on request of commenters.

grammar Sable;

options {}

@header {
    package org.sable.parser;
}

//
// PARSER RULES.

sourceFile              : statement* EOF;
statement               : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement     : expression (ASSIGNMENT_OPERATOR expression)?;

expression:
    expression WSA_OPERATOR expression
    | expression OPERATOR expression
    | OPERATOR expression
    | expression OPERATOR
    | atomicExpression
    ;

atomicExpression:
    IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
    ;

type                    : IDENTIFIER;


//
// LEXER RULES.

COMMENT                 : '/*' .*? '*/'                    -> channel(HIDDEN);
LINE_COMMENT            : '//' ~[\000A\000D]*              -> channel(HIDDEN);

ASSIGNMENT_OPERATOR     : Operator? '=';

// WSA = White Space Aware token.
// These are tokens that occurr in a given whitespace context.
WSA_OPERATOR:
    (WhiteSpaceNotNewline+ Operator WhiteSpaceNotNewline+)
    ;

OPERATOR         : Operator;

// Newline chars are defined apart because they carry meaning as a statement
// delimiter.
NEWLINE:
    ('\u000D' '\u000A')
    | '\u000A'
    ;

WS                      : WhiteSpaceNotNewline -> skip;

SEMICOLON               : ';';


IDENTIFIER:
    (IdentifierHead IdentifierCharacter*)
    | ('`'(IdentifierHead IdentifierCharacter*)'`')
    ;

fragment DecimalDigit   :'0'..'9';

fragment IdentifierHead:
    'a'..'z'
    | 'A'..'Z'
    | '_'
    | '\u00A8'
    | '\u00AA'
    | '\u00AD'
    | '\u00AF' |
    '\u00B2'..'\u00B5' |
    '\u00B7'..'\u00BA'  |
    '\u00BC'..'\u00BE' |
    '\u00C0'..'\u00D6' |
    '\u00D8'..'\u00F6' |
    '\u00F8'..'\u00FF' |
    '\u0100'..'\u02FF' |
    '\u0370'..'\u167F' |
    '\u1681'..'\u180D' |
    '\u180F'..'\u1DBF' |
    '\u1E00'..'\u1FFF' |
    '\u200B'..'\u200D' |
    '\u202A'..'\u202E' |
    '\u203F'..'\u2040' |
    '\u2054' |
    '\u2060'..'\u206F' |
    '\u2070'..'\u20CF' |
    '\u2100'..'\u218F' |
    '\u2460'..'\u24FF' |
    '\u2776'..'\u2793' |
    '\u2C00'..'\u2DFF' |
    '\u2E80'..'\u2FFF' |
    '\u3004'..'\u3007' |
    '\u3021'..'\u302F' |
    '\u3031'..'\u303F' |
    '\u3040'..'\uD7FF' |
    '\uF900'..'\uFD3D' |
    '\uFD40'..'\uFDCF' |
    '\uFDF0'..'\uFE1F' |
    '\uFE30'..'\uFE44' |
    '\uFE47'..'\uFFFD'
    ;
fragment IdentifierCharacter:
    DecimalDigit
    | '\u0300'..'\u036F'
    | '\u1DC0'..'\u1DFF'
    | '\u20D0'..'\u20FF'
    | '\uFE20'..'\uFE2F'
    | IdentifierHead
    ;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline    : [\u0020\u000C\u0009u000B\u000C];

fragment Operator:
    '*'
    | '/'
    | '%'
    | '+'
    | '-'
    | '<<'
    | '>>'
    | '&'
    | '^'
    | '|'
    ;

you should have created a minimal version of your grammar that still exhibits this behavior before posting here. — bolov, Sep 21 '17 at 22:07
Hm, OK, I will try to cut it down. It's pretty minimal though. — AmazingWouldBeGreatBut, Sep 21 '17 at 22:11
you can still easily cut try_operator, comment, line_comment, fragment_decimal. And I saw those just at a quick glance. — bolov, Sep 21 '17 at 22:38

GRosenberg · Accepted Answer · 2017-09-22T19:37:56.127

The rule

expression
    : monoOperatedExpression (binaryOperator monoOperatedExpression)?
    ;

does not permit an = after the binaryOperator. Accordingly, the runtime reports that it did not know what next rule to use following the consumption of the BINARY_OP.

The grammar can be fixed with some significant restructuring and, preferably, simplification.

1 - Whitespace/newline handling can be greatly simplified by ignoring it.

WS : [ \t\r\n] -> skip;

C-family and Python-like languages are context free languages with a few, well-known context sensitive corner cases. ANTLR is a context free parser with a number of convenience capabilities to handle context sensitivities. So, ignoring (or hiding) whitespace should be the default.

2 - disambiguate the use of * by definition:

STAR_EQUAL : '*=' ;
STAR       : '*'  ;
EQUAL      : '='  ;

This ensures that any single STAR is available to be considered only as a pointer mark or multiplication operator (the sequence STAR WS EQUAL is either invalid in your language or could have some custom meaning).

3 - use parser rule recursion: consider the expression handling rules in the C grammar, specifically starting with the expression rule. The simplified pattern is:

expression     // list of all valid syntaxes for an `expression`
    : LPAREN expression RPAREN
    | expression ( COMMA expression )*
    | expression op expression 
    | unitary_op expression 
    | expression unitary_op 
    | << any other valid syntax >>
    | atom
    ;

 unitary_op : 2PLUS | 2DASH | .... ;
 op         : STAR_EQUAL | STAR | EQUAL | .... ;

 atom
    : STAR? IDENTIFIER   // pointer usage
    | NUMBER
    ;

Presented this way, the grammar will be far more readable and maintainable.

With these changes, completing the revision of the grammar is left as an easy exercise for the OP (meaning, try it and post any problems encountered).

Bonus - ANTLR is a top down parser. So, put the parser rules at the top, organized broad to narrow. Followed by the lexer rules, also ordered in the same way, any lexer modes, and then with fragment rules at the very bottom.

This ordering ease the cognitive load of understanding the grammar by you and others. For example, makes the tree dump easier/quicker to understand. Will also ease the task of eventually dividing into a split grammar (recommended if the grammar is of any significant complexity and required if there are modes).

Full Grammar

grammar Sable;

@header {
    package org.sable.parser.gen;
}

sable
    : statement* EOF
    ;

statement
    : expression? SEMI
    ;

expression
    : LPAREN expression RPAREN
    | COMMA expression
    | expression op expression
    | unitary_op expression
    | expression unitary_op
    | STAR? IDENTIFIER
    | NUMBER
    ;

 unitary_op
    : DPLUS | DMINUS
    ;

 op : STAR_EQUAL | DIV_EQUAL | PLUS_EQUAL | MINUS_EQUAL | EQUAL
    | STAR | DIV | PLUS | MINUS
    ;


COMMENT     : Comment -> skip ;

STAR_EQUAL  : '*=' ;
DIV_EQUAL   : '/=' ;
PLUS_EQUAL  : '+=' ;
MINUS_EQUAL : '-=' ;
EQUAL       : '='  ;

STAR        : '*'  ; // mult or pointer
DIV         : '/'  ;
PLUS        : '+'  ;
MINUS       : '-'  ;

DPLUS       : '++' ;
DMINUS      : '--' ;

COMMA       : ','  ;
DOT         : '.'  ;
SEMI        : ';'  ;

LPAREN      : '('  ;
RPAREN      : ')'  ;
LBRACE      : '{'  ;
RBRACE      : '}'  ;
LBRACK      : '['  ;
RBRACK      : ']'  ;
LANGLE      : '<'  ;
RANGLE      : '>'  ;

NUMBER      : [0-9]+ ('.' [0-9]+)? ([eE] [+-]? [0-9]+)? ;
IDENTIFIER  : [a-zA-Z_][a-zA-Z0-9_-]*  ;

WS          : [ \t\r\n]+    -> skip;

ERRCHAR
    :   .   -> channel(HIDDEN)
    ;

fragment Comment
    :   '/*' .*? '*/'
    |   '//' ~[\r\n]*
    ;

Generates but is untested. Report back any corner cases not handled.

Thanks a lot for the hints. As I complete beginner (this is my first gramamar in antlr4), I findt them very useful. But I don't know how to follow your suggestion about igoring whitespace. Certain rules in my grammar cannot skip whitespace because the latter bears meaning; some operators are either binary or unary depending on the whitespace surrounding them. — AmazingWouldBeGreatBut, Sep 22 '17 at 08:09
Anyway, I will try your hints in a few hours and see what it comes out after them; afterwards I will accept your answer. Thanks again! — AmazingWouldBeGreatBut, Sep 22 '17 at 08:13
I am working on it, trying to understand every step. Let me kind of ask your patience. I am truly grateful for your help and I will come back with every bit of result. — AmazingWouldBeGreatBut, Sep 22 '17 at 21:44
I've just added a new version of the grammar. Still producing lots of parsing errors. The problem is that I need blank spaces to be skipped by default, but not in certain rules (see above). Certain operators change behaviour depending on the blanks surrounding them. Unfortunately, my current grammar still does not work. @GRosenberg - feel free to comment at your please. — AmazingWouldBeGreatBut, Sep 23 '17 at 15:15
As a matter of fact, the problem I am facing now (per-rule space awareness, yet skip them by default) was already answered to me [here](https://stackoverflow.com/questions/45504124/ambiguous-call-expression-in-antlr4-grammar/45511037#45511037). I will go and try it now. — AmazingWouldBeGreatBut, Sep 23 '17 at 15:21
What "operators change behaviour depending on the blanks surrounding them"? Almost sounds like you are trying to solve a semantic issue in the parser. Use the parser for syntactic disambiguation. Use a tree walker to resolve semantic issues. — GRosenberg, Sep 23 '17 at 18:21
It's a weird trait of the language I am trying to define - if an operator is surrounded by spaces on both sides or by absolutly no spaces on any side between two identifiers, then it must be treated as a binary operator; otherwise, as an unary operator affecting the identifier closer to it. In the question I linked two comments above they advise in fact to manage it not even at the parser level, but at the lexer level. — AmazingWouldBeGreatBut, Sep 23 '17 at 18:50
And I think I've just done it; I've updated my question with the current grammar. Thanks for your help! Folllowing your advice for using recursion did simplify things. I tried not to use left recursion because other parsing engines do not cope with it. And I am not about the impact of it in the generated code by antlr4. But it will do for now. — AmazingWouldBeGreatBut, Sep 23 '17 at 18:52

Why is antlr4 parsing my sentence into two statements?

1 Answers1

Linked