How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

Question

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.

I'm trying to write antlr3 grammar that allows me to match input such as:

"identifierkeyword"

I came up with a grammar like so in Antlr 3.4:

KEYWORD: 'keyword' ;

IDENTIFIER
: 
  (options {greedy=false;}: (LOWCHAR|HIGHCHAR))+ 
;

/** lowercase letters */
fragment LOWCHAR
:   'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
:   'A'..'Z';

parse: IDENTIFIER KEYWORD EOF;

however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)

Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.

After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.

Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:

KEYWORD: 'keyword' ;

/** lowercase letters */
fragment LOWCHAR
:   'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
:   'A'..'Z';

IDENTIFIER
: 
  (LOWCHAR|HIGHCHAR)+?
;

parse: IDENTIFIER KEYWORD EOF;

for the input: "identifierkeyword" it produces this error: line 1:1 mismatched input 'd' expecting 'keyword'

it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.

Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?

I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"

I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?

After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:

parse
:   
  identifier 'keyword'
;

identifier
:   
  (HIGHCHAR | LOWCHAR)+
;

/** lowercase letters */
LOWCHAR
:   'a'..'z';
/** uppercase letters */
HIGHCHAR
:   'A'..'Z';

This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments. A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?

Sam Harwell · Accepted Answer · 2014-01-30T22:25:22.727

The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.

identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;

This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.

The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.

Edit: The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

Ah, so thats what non-greedy matching is about, it matches the smallest token possible. What i thought was that it matches the longest possible IDENTIFIER token while looking for conflicting lexer rules in the lookahead and then decides by precedence. Thank you for your answer, if you don't mind i'd leave the question open since i still have parts i don't understand like why i get this error: "(The following alternatives can never be matched: 1)" with my first antlr3 grammar. — Attila Horváth, Jan 30 '14 at 22:16
Thats just too awesome, full credit goes to you then :) Thank you very much for clearing out all the clouds around my head. Actually i've never known i could be this happy about getting to learn something new — Attila Horváth, Jan 30 '14 at 22:52
May i humbly ask you if you would help me in this matter too? (This question has been around a lot and i've also added a bounty on it): http://stackoverflow.com/questions/20938550/parse-stacked-comparison-expression-into-logical-conjunction-tree-with-antlr3 — Attila Horváth, Jan 30 '14 at 23:00

How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

1 Answers1

Linked