Advice on handling an ambiguous operator in an ANTLR 4 grammar

Question

I am writing an antlr grammar file for a dialect of basic. Most of it is either working or I have a good idea of what I need to do next. However, I am not at all sure what I should do with the '=' character which is used for both equality tests as well as assignment.

For example, this is a valid statement

t = (x = 5) And (y = 3)

This evaluates if x is EQUAL to 5, if y is EQUAL to 3 then performs a logical AND on those results and ASSIGNS the result to t.

My grammar will parse this; albeit incorrectly, but I think that will resolve itself once the ambiguity is resolved .

ParseTreeExample

How do I differentiate between the two uses of the '=' character?
1) Should I remove the assignment rule from expression and handle these cases (assignment vs equality test) in my visitor and\or listener implementation during code generation

2) Is there a better way to define the grammar such that it is already sorted out

Would someone be able to simply point me in the right direction as to how best implement this language "feature"?

Also, I have been reading through the Definitive guide to ANTLR4 as well as Language Implementation Patterns looking for a solution to this. It may be there but I have not yet found it.

Below is the full parser grammar. The ASSIGN token is currently set to '='. EQUAL is set to '=='.

parser grammar wlParser;

options { tokenVocab=wlLexer; }

program
    :   multistatement (NEWLINE multistatement)* NEWLINE?
    ;

multistatement
    :   statement (COLON statement)*
    ;

statement
    :   declarationStat
    |   defTypeStat
    |   assignment
    |   expression
    ;

assignment
    :   lvalue op=ASSIGN expression
    ;

expression
    :   <assoc=right> left=expression op=CARAT right=expression #exponentiationExprStat
    |   (PLUS|MINUS) expression #signExprStat
    |   IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN #arrayIndexExprStat
    |   left=expression op=(ASTERISK|FSLASH) right=expression #multDivExprStat
    |   left=expression op=BSLASH right=expression #integerDivExprStat
    |   left=expression op=KW_MOD right=expression #modulusDivExprStat
    |   left=expression op=(PLUS|MINUS) right=expression #addSubExprStat
    |   left=string op=AMPERSAND right=string #stringConcatenation
    |   left=expression op=(RELATIONALOPERATORS | KW_IS | KW_ISA) right=expression #relationalComparisonExprStat
    |   left=expression (op=LOGICALOPERATORS right=expression)+ #logicalOrAndExprStat
    |   op=KW_LIKE patternString #likeExprStat
    |   LPAREN expression RPAREN #groupingExprStat
    |   NUMBER #atom
    |   string #atom
    |   IDENTIFIER DATATYPESUFFIX? #atom
    ;

lvalue
    :   (IDENTIFIER DATATYPESUFFIX?) | (IDENTIFIER DATATYPESUFFIX? LPAREN expression RPAREN)
    ;

string
    :   STRING
    ;

patternString
    :   DQUOT (QUESTIONMARK | POUND | ASTERISK | LBRACKET BANG? .*? RBRACKET)+ DQUOT
    ;

referenceType
    :   DATATYPE
    ;

declarationStat
    :   constDecl
    |   varDecl
    ;

constDecl
    :   CONSTDECL? KW_CONST IDENTIFIER EQUAL expression
    ;

varDecl
    :   VARDECL (varDeclPart (COMMA varDeclPart)*)? | listDeclPart
    ;

varDeclPart
    :   IDENTIFIER DATATYPESUFFIX? ((arrayBounds)? KW_AS DATATYPE (COMMA DATATYPE)*)?
    ;

listDeclPart
    :   IDENTIFIER DATATYPESUFFIX? KW_LIST KW_AS DATATYPE
    ;

arrayBounds
    :   LPAREN (arrayDimension (COMMA arrayDimension)*)? RPAREN
    ;

arrayDimension
    :   INTEGER (KW_TO INTEGER)?
    ;

defTypeStat
    :   DEFTYPES DEFTYPERANGE (COMMA DEFTYPERANGE)*
    ;

This is the lexer grammar.

lexer grammar wlLexer;

NUMBER
    :   INTEGER
    |   REAL
    |   BINARY
    |   OCTAL
    |   HEXIDECIMAL
    ;

RELATIONALOPERATORS
    :   EQUAL
    |   NEQUAL
    |   LT
    |   LTE
    |   GT
    |   GTE
    ;

LOGICALOPERATORS
    :   KW_OR
    |   KW_XOR
    |   KW_AND
    |   KW_NOT
    |   KW_IMP
    |   KW_EQV
    ;

INSTANCEOF
    :   KW_IS
    |   KW_ISA
    ;

CONSTDECL
    :   KW_PUBLIC
    |   KW_PRIVATE
    ;

DATATYPE
    :   KW_BOOLEAN
    |   KW_BYTE
    |   KW_INTEGER
    |   KW_LONG
    |   KW_SINGLE
    |   KW_DOUBLE
    |   KW_CURRENCY
    |   KW_STRING
    ;

VARDECL
    :   KW_DIM
    |   KW_STATIC
    |   KW_PUBLIC
    |   KW_PRIVATE
    ;

LABEL
    :   IDENTIFIER COLON
    ;

DEFTYPERANGE
    :   [a-zA-Z] MINUS [a-zA-Z]
    ;

DEFTYPES
    :   KW_DEFBOOL
    |   KW_DEFBYTE
    |   KW_DEFCUR
    |   KW_DEFDBL
    |   KW_DEFINT
    |   KW_DEFLNG
    |   KW_DEFSNG
    |   KW_DEFSTR
    |   KW_DEFVAR
    ;

DATATYPESUFFIX
    :   PERCENT
    |   AMPERSAND
    |   BANG
    |   POUND
    |   AT
    |   DOLLARSIGN
    ;

STRING
    :   (DQUOT (DQUOTESC|.)*? DQUOT)
    |   (LBRACE (RBRACEESC|.)*? RBRACE)
    |   (PIPE (PIPESC|.|NEWLINE)*? PIPE)
    ;

fragment DQUOTESC:          '\"\"' ;
fragment RBRACEESC:         '}}' ;
fragment PIPESC:            '||' ;

INTEGER
    :   DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
    ;

REAL
    :   DIGIT+ PERIOD DIGIT+ (E (PLUS|MINUS)? DIGIT+)?
    ;

BINARY
    :   AMPERSAND B BINARYDIGIT+
    ;

OCTAL
    :   AMPERSAND O OCTALDIGIT+
    ;

HEXIDECIMAL
    :   AMPERSAND H HEXDIGIT+
    ;

QUESTIONMARK:               '?' ;
COLON:                      ':' ;
ASSIGN:                     '=';
SEMICOLON:                  ';' ;
AT:                         '@' ;
LPAREN:                     '(' ;
RPAREN:                     ')' ;
DQUOT:                      '"' ;
LBRACE:                     '{' ;
RBRACE:                     '}' ;
LBRACKET:                   '[' ;
RBRACKET:                   ']' ;
CARAT:                      '^' ;
PLUS:                       '+' ;
MINUS:                      '-' ;
ASTERISK:                   '*' ;
FSLASH:                     '/' ;
BSLASH:                     '\\' ;
AMPERSAND:                  '&' ;
BANG:                       '!' ;
POUND:                      '#' ;
DOLLARSIGN:                 '$' ;
PERCENT:                    '%' ;
COMMA:                      ',' ;
APOSTROPHE:                 '\'' ;
TWOPERIODS:                 '..' ;
PERIOD:                     '.' ;
UNDERSCORE:                 '_' ;
PIPE:                       '|' ;
NEWLINE:                    '\r\n' | '\r' | '\n';
EQUAL:                      '==' ;
NEQUAL:                     '<>' | '><' ;
LT:                         '<' ;
LTE:                        '<=' | '=<';
GT:                         '>' ;
GTE:                        '=<'|'<=' ;

KW_AND:                     A N D ;
KW_BINARY:                  B I N A R Y ;
KW_BOOLEAN:                 B O O L E A N ;
KW_BYTE:                    B Y T E ;
KW_DATATYPE:                D A T A T Y P E ;
KW_DATE:                    D A T E ;
KW_INTEGER:                 I N T E G E R ;
KW_IS:                      I S ;
KW_ISA:                     I S A ;
KW_LIKE:                    L I K E ;
KW_LONG:                    L O N G ;
KW_MOD:                     M O D ;
KW_NOT:                     N O T ;
KW_TO:                      T O ;
KW_FALSE:                   F A L S E ;
KW_TRUE:                    T R U E ;
KW_SINGLE:                  S I N G L E ;
KW_DOUBLE:                  D O U B L E ;
KW_CURRENCY:                C U R R E N C Y ;
KW_STRING:                  S T R I N G ;

fragment BINARYDIGIT:       ('0'|'1') ;
fragment OCTALDIGIT:        ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7') ;
fragment DIGIT:             '0'..'9' ;
fragment HEXDIGIT:          ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9' | A | B | C | D | E | F) ;
fragment A:                 ('a'|'A');
fragment B:                 ('b'|'B');
fragment C:                 ('c'|'C');
fragment D:                 ('d'|'D');
fragment E:                 ('e'|'E');
fragment F:                 ('f'|'F');
fragment G:                 ('g'|'G');
fragment H:                 ('h'|'H');
fragment I:                 ('i'|'I');
fragment J:                 ('j'|'J');
fragment K:                 ('k'|'K');
fragment L:                 ('l'|'L');
fragment M:                 ('m'|'M');
fragment N:                 ('n'|'N');
fragment O:                 ('o'|'O');
fragment P:                 ('p'|'P');
fragment Q:                 ('q'|'Q');
fragment R:                 ('r'|'R');
fragment S:                 ('s'|'S');
fragment T:                 ('t'|'T');
fragment U:                 ('u'|'U');
fragment V:                 ('v'|'V');
fragment W:                 ('w'|'W');
fragment X:                 ('x'|'X');
fragment Y:                 ('y'|'Y');
fragment Z:                 ('z'|'Z');

IDENTIFIER
    :   [a-zA-Z_][a-zA-Z0-9_~]*
    ;


LINE_ESCAPE
    :   (' ' | '\t') UNDERSCORE  ('\r'? | '\n')
    ;

WS
    :   [ \t] -> skip
    ;

I've added the full parser and lexer grammar. Note that currently assign is set to '=' and equality is set to '=='. thanks for reviewing this. — Kelvin Johnson, Oct 13 '14 at 14:46
This is reeeeeeeally ugly and hard and bad and interesting :) So, if you could elaborate on how to differentiate assignment and equality: Is equality always written in parenthesis? Does always go with "if"? Is it safe to say that if "=" is a part of logical expression, than it's always equality? — cantSleepNow, Oct 13 '14 at 20:35
Did you try with `` on `assignment` (which logically should be right-associative anyway)? — Lucas Trzesniewski, Oct 13 '14 at 20:41
Lucas: Yeah. I've tried assoc=right. Although, at that point I was just reaching for anything. — Kelvin Johnson, Oct 14 '14 at 01:01
can we assume that the first '=' on a line is an assignment operator? — hendryau, Oct 14 '14 at 01:07
cantSleepNow: Parenthesis are not required. Being part of an if\do\while\for\etc is not required. The sample statement above could be written t = x=5 and y=3 and still be valid (try this ins visual basic). Although the grouping would be incorrect since equality '=' and 'AND' have the same precedence. The only thing I can think of now is that the _first_'=' sign has to be for assignment. And assignments cannot be nested within another statement except a looping construct such as 'for' — Kelvin Johnson, Oct 14 '14 at 01:08
hendryau: This is what I am researching now. I am going through the docs and writing code to test this assumption: the first '=' on a line is the assignment. — Kelvin Johnson, Oct 14 '14 at 01:33
@Kelvin FYI, when talking to people you should precede thein name with a `@` (just like I did) so they get notified (see [details](http://stackoverflow.com/editing-help#comment-formatting)). — Lucas Trzesniewski, Oct 14 '14 at 18:18
@Lucas Got it. I've read this site for some time now. Yet I'm still a noob! — Kelvin Johnson, Oct 14 '14 at 18:54

hendryau · Accepted Answer · 2014-10-14T18:40:15.057

Take a look at this grammar (Note that this grammar is not supposed to be a grammar for BASIC, it's just an example to show how to disambiguate using "=" for both assignment and equality):

grammar Foo;

program:
    (statement | exprOtherThanEquality)*
    ;

statement:
    assignment
    ;

expr:
    equality | exprOtherThanEquality
    ;

exprOtherThanEquality:
    boolAndOr
    ;

boolAndOr:
    atom (BOOL_OP expr)*
    ;

equality:
    atom EQUAL expr
    ;

assignment:
    VAR EQUAL expr ENDL
    ;

atom:
    BOOL | 
    VAR | 
    INT |
    group
    ;

group:
    LEFT_PARENTH expr RGHT_PARENTH
    ;

ENDL         : ';' ;
LEFT_PARENTH : '(' ;
RGHT_PARENTH : ')' ;
EQUAL        : '=' ;

BOOL:
    'true' | 'false'
    ;

BOOL_OP:
    'and' | 'or'
    ;

VAR:
    [A-Za-z_]+ [A-Za-z_0-9]*
    ;

INT:
    '-'? [0-9]+
    ;

WS:
    [ \t\r\n] -> skip
    ;

Here is the parse tree for the input: t = (x = 5) and (y = 2); enter image description here

In one of the comments above, I asked you if we can assume that the first equal sign on a line always corresponds to an assignment. I retract that assumption slightly... The first equal sign on a line always corresponds to an assignment unless it is contained within parentheses. With the above grammar, this is a valid line: (x = 2). Here is the parse tree:

enter image description here

Did you write this grammar? Or did you find it at some website? — Kelvin Johnson, Oct 14 '14 at 23:15

Advice on handling an ambiguous operator in an ANTLR 4 grammar

1 Answers1