ParserRule matching the wrong token

Question

I'm trying to learn a bit ANTLR4 and define a grammar for some 4GL language.

This is what I've got:

compileUnit
:
    typedeclaration EOF
;

typedeclaration
:
    ID LPAREN DATATYPE INT RPAREN
;

DATATYPE
:
    DATATYPE_ALPHANUMERIC
    | DATATYPE_NUMERIC
;

DATATYPE_ALPHANUMERIC
:
    'A'
;

DATATYPE_NUMERIC
:
    'N'
;

fragment
DIGIT
:
    [0-9]
;

fragment
LETTER
:
    [a-zA-Z]
;

INT
:
    DIGIT+
;

ID
:
    LETTER
    (
        LETTER
        | DIGIT
    )*
;

LPAREN
:
    '('
;

RPAREN
:
    ')'
;

WS
:
    [ \t\f]+ -> skip
;

What I want to be able to parse:

TEST (A10)

what I get:

typedeclaration:1:6: mismatched input 'A10' expecting DATATYPE

I am however able to write:

TEST (A 10)

Why do I need to put a whitespace in here? The LPAREN DATATYPE in itself is working, so there is no need for a space inbetween. Also the INT RPAREN is working. Why is a space needed between DATATYPE and INT? I'm a bit confused on that one. I guess that it's matching ID because it's the "longest" match, but there must be some way to force to be lazier here, right?

Because `A10` is a valid token for ID. Do you want to match the first variant or both? ANTLR always matches tokens as long as possible (and tokens are created before parsing). — CoronA, Aug 13 '16 at 09:52
@CoronA I want to match the first variant, without the space inbetween. — Markus A., Aug 13 '16 at 10:59
I can think of following solutions: include whitespace explicitly (instead of skipping), make IDs not begin with `A` or `N` or accept any ID within the parentheses and filter out wrong ones in semantic analysis. Are there preferences? — CoronA, Aug 13 '16 at 11:27

score 1 · Accepted Answer · edited Sep 17 '17 at 19:26

You should ignore 'A' and 'N' chats at first position of ID. As @CoronA noticed ANTLR matches token as long as possible (length of ID 'A10' more than length of DATATYPE_ALPHANUMERIC 'A'). Also read this: Priority rules. Try to use the following grammar:

grammar expr;

compileUnit
    : typedeclaration EOF
    ;

typedeclaration
    : ID LPAREN datatype INT RPAREN
    ;

datatype
    : DATATYPE_ALPHANUMERIC
    | DATATYPE_NUMERIC
    ;

DATATYPE_ALPHANUMERIC
    : 'A'
    ;

DATATYPE_NUMERIC
    : 'N'
    ;

INT
    : DIGIT+
    ;

ID
    : [b-mo-zB-MO-Z] (LETTER | DIGIT)*
;

LPAREN
    : '('
    ;

RPAREN
    : ')'
    ;

WS
    : [ \t\f]+ -> skip
    ;

fragment
DIGIT
    : [0-9]
    ;

fragment
LETTER
    : [a-zA-Z]
    ;

Also you can use the following grammar without id restriction. Data types will be recognized earlier than letters. it's not clear too:

grammar expr;

compileUnit
    : typedeclaration EOF
    ;

typedeclaration
    : id LPAREN datatype DIGIT+ RPAREN
    ;

id
    : (datatype | LETTER) (datatype | LETTER | DIGIT)*
    ;

datatype
    : DATATYPE_ALPHANUMERIC
    | DATATYPE_NUMERIC
    ;

DATATYPE_ALPHANUMERIC: 'A';
DATATYPE_NUMERIC:      'N';
// List with another Data types.
LETTER:                [a-zA-Z];

LPAREN
    : '('
    ;

RPAREN
    : ')'
    ;

WS
    : [ \t\f]+ -> skip
    ;

DIGIT
    : [0-9]
    ;

Thanks for your example @KvanTTT, I kinda see where this is going. But now my identifiers can't start with A or N, which is not a desired behaviour. There are 11 datatypes which are just represented by a single letter, I omitted all but 2 for readability — Markus A., Aug 14 '16 at 15:01
I could let the typedeclaration expect an ID instead of datatype and extract it within the listener/visitor, but wouldn't that be a "grammar smell"? :-) — Markus A., Aug 14 '16 at 15:08
@MarkusA. I removed id restriction. If this grammar is not suit to you than listener/visitor approach should be used. — Ivan Kochurkin, Aug 14 '16 at 15:41
I replaced the SO documentation link in your answer, as docs is being shut down. It points to the same content, but I've expanded it a bit. — Lucas Trzesniewski, Sep 17 '17 at 19:29

ParserRule matching the wrong token

1 Answers1