1

We all know variable names cannot start with a number. (eg. foo1 is valid, 1foo is not).

I am trying to write a grammar file to allow only valid variable names, and they must be followed by a colon. (This is part of a much larger grammar - I'm just stuck on this one part)

It seems like it should be simple. I define a rule id that takes in only an alpha value as its first character, followed by any number of alpha-numeric characters. However what seems like a simple task is failing for me. Can anyone explain why?

Here is my grammar:

grammar validName;

var_declaration :VAR id COLON;
VAR: 'var';
COLON: ':';
DIGIT: [0-9];
ALPHA: [a-zA-Z_];
ALPHANUM: ALPHA | DIGIT;

id: ALPHA ALPHANUM*;

WS: [ \n\t\r]+ -> skip;

Here is my test input:

var myId : 

And here is the error:

line 1:5 mismatched input 'y' expecting ':'

Why is ALPHANUM* not matching anything??

john k
  • 6,268
  • 4
  • 55
  • 59
  • The answer at https://stackoverflow.com/questions/21467473/how-lexer-lookahead-works-with-greedy-and-non-greedy-matching-in-antlr3-and-antl may help you. It suggests you want to accept SINGLE | MULTI, where SINGLE is just ALPHA and MULTI is ALPHA ALPHANUM+ – J_H Feb 02 '18 at 23:35
  • Close but not quite. His question seems to be about how greedy operators work, and his problem is slightly different. I need the first character to be ALPHA followed by any ALPHANUM. There is no 'or' in there. I do not want to choose between ALPHA and ALPHA ALPHANUM. It's ALPHA ALPHANUM*, and thats the only choice. And I do not use the + operator, I need to use the *. I will follow the links there though. At least you've given me a place to start. – john k Feb 03 '18 at 00:04

1 Answers1

3

In Antlr, the lexer will run to completion before the parser runs. Parser rules have no influence over how the lexer behaves.

So, given the text myId, the lexer is going to emit four ALPHA tokens. This is because the ALPHA rule occurs first and the match length for both the ALPHA and ALPHANUM rules is the same. Actually ALPHANUM will never match since DIGIT is listed earlier.

Try:

var_declaration :VAR ID COLON;

VAR: 'var';
ID: ALPHA ( ALPHA | DIGIT )*;

COLON: ':';
DIGIT: [0-9];
ALPHA: [a-zA-Z_];
WS: [ \n\t\r]+ -> skip;
GRosenberg
  • 5,843
  • 2
  • 19
  • 23
  • This worked! Now you've got to explain to me why ID: ALPHA ( ALPHA | DIGIT )* is not the same as ALPHANUM: ALPHA | DIGIT; ID: ALPHA ALPHANUM*; – john k Feb 03 '18 at 01:03
  • And what do you mean by ALPHANUM will never match since DIGIT is listed earlier? ALPHANUM can be either digit OR Alpha. Doesnt '|' mean 'or'? If the lexer doesn't find a match at first, it then next looks at the other side of the pipe? – john k Feb 03 '18 at 01:09
  • 3
    Both `ALPHA` and `DIGIT` occur before `ALPHANUM`, so only `ALPHA` and `DIGIT` tokens will every be emitted by the lexer. While your understanding of the alt operator is correct, the lexer will always choose the first matching rule of those that have the same match length (all three rules have a match length of one). – GRosenberg Feb 03 '18 at 01:27
  • Ouch! Good explanation. It seems that debug output which reveals the lexer output would aid folks trying to understand whether their grammar is correct. – J_H Feb 03 '18 at 18:18