4

I have the following grammar:

rule: 'aaa' | 'a' 'a';

It can successfully parse the string 'aaa', but it fails to parse 'aa' with the following error:

line 1:2 mismatched character '<EOF>' expecting 'a'

FYI, it is the lexer's problem not the parser's because I don't even call the parser. The main function looks like:

@members {
  public static void main(String[] args) throws Exception {
    RecipeLexer lexer = new RecipeLexer(new ANTLRInputStream(System.in));
    for (Token t = lexer.nextToken(); t.getType() != EOF; t = lexer.nextToken())
      System.out.println(t.getType());
  }
}

The result is the same with the more obvious version:

rule: AAA | A A;
AAA: 'aaa';
A: 'a';

Obviously the ANTLR lexer tries to match the input 'aa' with the rule AAA which fails. Apart from that ANTLR is an LL(*) parser or whatever, the lexer should work separately from the parser and it should be able to resolve ambiguity. The grammar works fine with the good old lex(or flex) but it doesn't seem with ANTLR. So what is the problem here?

Thanks for the help!

K J
  • 4,505
  • 6
  • 27
  • 45
  • How are the tokens defined in your lexer? Looks to me that the lexer is preferring to match for `a` instead of `aaa` given a single `a` as input. – Dervall Aug 30 '12 at 06:08
  • @Dervall The token file looks like: `A=4 AAA=5` It prefers `aaa` to `a`. And it can parse `aaa` and `a` but not `aa`. – K J Aug 30 '12 at 06:40
  • @AustinHenley: Yes, it is greedy in the sense that it prefers longer tokens when there are multiple choices. But with the input 'aa', 'aaa' is not even a possible choice. – K J Aug 30 '12 at 06:42
  • Check out this incredibly detailed yet easy to follow page: https://wincent.com/wiki/ANTLR_lexers_in_depth. It helped me a lot to understand the ANTLR Lexer quirks. Especially the ".+ and .* default to non-greedy behaviour" is quite surprising! – TFuto Aug 09 '13 at 19:24

1 Answers1

6

ANTLR's generated parsers are (or can be) LL(*), not its lexers.

When the lexer sees the input "aa", it tries to match token AAA. When it fails to do so, it tries to match any other token that also matches "aa" (the lexer does not backtrack to match A!). Since this is not possible, an error is produced.

This is usually not a problem, since in practice, there's often some sort of identifier rule "aa" can fall back to. So, what actual problem are you trying solve, or were you only curious of the inner workings? If it's the first, please edit your question and describe your actual problem.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Thanks for the clarification, Bart. I guess it's closer to the second. I've been using lex/yacc and I'm trying to switch to ANTLR. ANTLR parser already has its limitation as an LL parser, but as you pointed out, this is about the lexer not the parser. To be honest, I'll be a little bit disappointed if the ANTLR lexer can't handle this amount of complexity, where other lexers like `lex` can do. Backtracking cost wouldn't be huge, O(n^2) at worst and can be better if dealt smartly. – K J Aug 30 '12 at 07:30
  • @KJ, there are ways to solve this, of course. But rather than explaining how to solve your "straw-man" example, I'd rather try to propose a solution to the "real" problem at hand (otherwise I end up answering twice...). – Bart Kiers Aug 30 '12 at 07:45
  • I'm afraid I'm not seeking for a workaround for a specific problem. As I said, it's closer to curiosity as I was considering to use ANTLR because it supports JAVA unlike yacc, but I'm getting cautious. I know there is a workaround to this problem with manual lookahead (I've seen your [previous post](http://stackoverflow.com/a/8800722/456933)), but having to deal with similar problems case-by-case seems not reliable.. Thanks for the answer though! – K J Aug 30 '12 at 08:25