1

What is the order of invocation for parser and lexer rules in a ANTLR grammar ? For example in the following grammar an input of

223

is always identified as APLHANUMERIC not digit

digit  : F_DIGIT+;
alpha  : APLHANUMERIC;
APLHANUMERIC   : (LOWERCASE | UPPERCASE | F_DIGIT | '_')+ ;
fragment LOWERCASE  : [a-z] ;
fragment UPPERCASE  : [A-Z] ;
fragment F_DIGIT   : [0-9] ;
sepp2k
  • 363,768
  • 54
  • 674
  • 675
KVM
  • 125
  • 14
  • 2
    Lexer rules always take precedence, as they touch the initial text and generate tokens for parser rules to consume, https://en.wikipedia.org/wiki/Lexical_analysis If you want to change the token generated, you might convert `digit` to a lexer rule. – Lex Li Sep 13 '21 at 16:02
  • Great thanks for the suggestion. Between fragments and a normal lexer rule, which one will have the precedence ? – KVM Sep 13 '21 at 16:07
  • Your fragments do not take precendence at all, because they are not considered lexer rules. That's what exactly "fragments" mean, https://stackoverflow.com/questions/6487593/what-does-fragment-mean-in-antlr – Lex Li Sep 13 '21 at 16:44
  • 1
    The lexer tokenizes the entire input the moment the parser calls the lexer for the first token. And, the lexer is not influenced by the parser, no matter what rule you invoke. For your grammar, `digit : F_DIGIT+;` is a parser rule (LHS symbol begins with lowercase letter). The only token the lexer can return is via `APLHANUMERIC : (LOWERCASE | UPPERCASE | F_DIGIT | '_')+ ;` (lexer rules are where the LHS symbol begins with an uppercase letter). The other lexer rules are fragments, so the lexer never can create a token with those. Check the warnings from the Tool. – kaby76 Sep 13 '21 at 17:18

1 Answers1

3

To elaborate a bit on the comments:

The Tokenizer (AKA Lexer) will always process you input stream producing a stream of tokens for the parser rules to use when recognizing your source structure.

The only "order of invocation" is that the Tokenizer runs before the parser (an obvious necessity, since the parser acts on the tokens produced by the parser).

For lexer rules, all rules are logically applied against your input stream. If you have more than one Lexer rule that can match the next characters in your input then, two rules come into play.

1 - If one Lexer rule matches a longer set of characters then it will be used to produce the token.

2 - If more than one rule matches the same number of characters in your input stream, then the first of those rules to appear in your grammar will "win"

fragments are not lexer rules. They are just a convenience you can take advantage of to compose Lexer rules to avoid repetition and aid readability.

In the parser, you choose the starting rule, and then the parser processes the contents of that rule (recursively calling the rules that make up that rule and it's children, etc.). The only "order" involved there is that ANTLR will evaluate top level alternatives in a rule in order and this can be used to address things like proper operator precedence in arithmetic expressions, etc.

Mike Cargal
  • 6,610
  • 3
  • 21
  • 27
  • Well explained! But let me clarify what it means when people say that the lexer runs before the parser. Actually, the parser is the driving force here. It asks for next tokens as it needs them. Most of the time only a single look ahead is required (the next token) and the lexer will only scan the input for this single token. However, if the lexer needs more lookahead (remember the ALL(*) algorithm can use unlimited look aheads, if necessary) then more tokens are retrieved from the lexer. It's not like the lexer will always consume all the input, before the parser even does its first step! – Mike Lischke Sep 14 '21 at 06:55