I am writing a little parser for expressions. At the moment I just want it to recognize binary multiplications (myId * myId) and C-like dereferenced pointers (*myId), plus some assignation statements (myId *= myId).
The input that makes the parser throw errors is:
x *= y;
... on which the parser fails with this message and parse tree:
[line 1:1 mismatched input ' *' expecting {';', NEWLINE}]
(sourceFile (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression x)))) * = ) (statement (expressionStatement (expression (monoOperatedExpression (atomicExpression y)))) ;) <EOF>)
I've been scratching my head for a while but I can't see what is wrong in my grammar (see it below). Any hints, please? Thanks in advance.
grammar Sable;
options {}
@header {
package org.sable.parser;
}
ASSIGNMENT_OP:
'='
;
BINARY_OP:
'*'
;
WS_BUT_NOT_NEWLINE:
WhiteSpaceButNotNewLineCharacter
;
NEWLINE:
('\u000D' '\u000A')
| '\u000A'
;
WSA_BINARY_OP:
(WS_BUT_NOT_NEWLINE+ BINARY_OP WS_BUT_NOT_NEWLINE+)
| BINARY_OP
;
WSA_PREFIX_OP:
(WS_BUT_NOT_NEWLINE+ '*' )
;
WS : WhiteSpaceCharacter+ -> skip
;
IDENTIFIER:
(IdentifierHead IdentifierCharacter*)
| ('`'(IdentifierHead IdentifierCharacter*)'`')
;
// NOTE: a file with zero statements is allowed because
// it can contain just comments.
sourceFile:
statement* EOF;
statement:
expressionStatement (';' | NEWLINE);
// Req. not existing any valid expression starting from
// an equals sign or any other assignment operator.
expressionStatement:
expression (assignmentOperator expression)?;
expression:
monoOperatedExpression (binaryOperator monoOperatedExpression)?
;
monoOperatedExpression:
atomicExpression
;
binaryOperator:
WSA_BINARY_OP
;
atomicExpression:
IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
;
type:
IDENTIFIER
;
assignmentOperator:
ASSIGNMENT_OP
;
fragment DecimalDigit:
'0'..'9'
;
fragment IdentifierHead:
'a'..'z'
| 'A'..'Z'
;
fragment IdentifierCharacter:
DecimalDigit
| IdentifierHead
;
fragment WhiteSpaceCharacter:
WhiteSpaceButNotNewLineCharacter
| NewLineCharacter;
fragment WhiteSpaceButNotNewLineCharacter:
[\u0020\u000C\u0009u000B\u000C]
;
fragment NewLineCharacter:
[\u000A\u000D]
;
EDIT: adding a new version of the grammar on request of commenters.
grammar Sable;
options {}
@header {
package org.sable.parser;
}
//
// PARSER RULES.
sourceFile : statement* EOF;
statement : expressionStatement (SEMICOLON | NEWLINE);
expressionStatement : expression (ASSIGNMENT_OPERATOR expression)?;
expression:
expression WSA_OPERATOR expression
| expression OPERATOR expression
| OPERATOR expression
| expression OPERATOR
| atomicExpression
;
atomicExpression:
IDENTIFIER ('<' type (',' type)* '>')? //TODO: can this be a lsv?
;
type : IDENTIFIER;
//
// LEXER RULES.
COMMENT : '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT : '//' ~[\000A\000D]* -> channel(HIDDEN);
ASSIGNMENT_OPERATOR : Operator? '=';
// WSA = White Space Aware token.
// These are tokens that occurr in a given whitespace context.
WSA_OPERATOR:
(WhiteSpaceNotNewline+ Operator WhiteSpaceNotNewline+)
;
OPERATOR : Operator;
// Newline chars are defined apart because they carry meaning as a statement
// delimiter.
NEWLINE:
('\u000D' '\u000A')
| '\u000A'
;
WS : WhiteSpaceNotNewline -> skip;
SEMICOLON : ';';
IDENTIFIER:
(IdentifierHead IdentifierCharacter*)
| ('`'(IdentifierHead IdentifierCharacter*)'`')
;
fragment DecimalDigit :'0'..'9';
fragment IdentifierHead:
'a'..'z'
| 'A'..'Z'
| '_'
| '\u00A8'
| '\u00AA'
| '\u00AD'
| '\u00AF' |
'\u00B2'..'\u00B5' |
'\u00B7'..'\u00BA' |
'\u00BC'..'\u00BE' |
'\u00C0'..'\u00D6' |
'\u00D8'..'\u00F6' |
'\u00F8'..'\u00FF' |
'\u0100'..'\u02FF' |
'\u0370'..'\u167F' |
'\u1681'..'\u180D' |
'\u180F'..'\u1DBF' |
'\u1E00'..'\u1FFF' |
'\u200B'..'\u200D' |
'\u202A'..'\u202E' |
'\u203F'..'\u2040' |
'\u2054' |
'\u2060'..'\u206F' |
'\u2070'..'\u20CF' |
'\u2100'..'\u218F' |
'\u2460'..'\u24FF' |
'\u2776'..'\u2793' |
'\u2C00'..'\u2DFF' |
'\u2E80'..'\u2FFF' |
'\u3004'..'\u3007' |
'\u3021'..'\u302F' |
'\u3031'..'\u303F' |
'\u3040'..'\uD7FF' |
'\uF900'..'\uFD3D' |
'\uFD40'..'\uFDCF' |
'\uFDF0'..'\uFE1F' |
'\uFE30'..'\uFE44' |
'\uFE47'..'\uFFFD'
;
fragment IdentifierCharacter:
DecimalDigit
| '\u0300'..'\u036F'
| '\u1DC0'..'\u1DFF'
| '\u20D0'..'\u20FF'
| '\uFE20'..'\uFE2F'
| IdentifierHead
;
// Non-newline whitespaces are defined apart because they carry meaning in
// certain contexts, e.g. within space-aware operators.
fragment WhiteSpaceNotNewline : [\u0020\u000C\u0009u000B\u000C];
fragment Operator:
'*'
| '/'
| '%'
| '+'
| '-'
| '<<'
| '>>'
| '&'
| '^'
| '|'
;