ANTLR4: Matching an identifier but NOT a keyword

Question

I'm using ANTLR4 to lex and parse a string. The string is this:

alpha at 3

The grammar is as such:

access: IDENTIFIER 'at' INT;
IDENTIFIER: [A-Za-z]+;
INT: '-'? ([1-9][0-9]* | [0-9]);

However, this ANTLR gives me line 1:6 mismatched input 'at' expecting 'at'. I've found that it is because IDENTIFIER is a superset of 'at', as seen in this answer. So, I tried changing the grammar to this:

access: identifier AT INT;
identifier: NAME | ~AT;
NAME: [A-Za-z]+;
INT: '-'? ([1-9][0-9]* | [0-9]);
AT: 'at';

However I get an identical error.

How can I match alpha at 3 where alpha is [A-Za-z]+ while at is also in [A-Za-z]+?

Your first version of the grammar does not give me the error (but your second version does). — Sweeper, Nov 22 '21 at 17:03
If you move `AT` to before `IDENTIFIER` in the second version of the grammar, I think that should work too. — Sweeper, Nov 22 '21 at 17:06

score 1 · Answer 1 · answered Nov 22 '21 at 18:00

I found in my work with ANTLR4 it was easier to divide my grammer into a seperate lexer and Parser. This has it's own learning curve. But the result is that I think about "Tokens" being fed to the parser. And I can use grun -tokens to see that my tokens are being recognized by the lexer before they get to the parser. I'm still an ANTLR4 newbie so maybe 2 weeks ahead of your on the learning curve after playing with ANTLR4 off and on for a few years.

So in my Grammer file I would have myLexer.g4:

AT: 'at';
IDENTIFIER: [a-ZA-Z]+;
INT:      -?[0-9]+;

myParser.g4:

 access: IDENTIFIER AT INT;

Beware after you do:

 antlr4 myLexer.g4
 antlr4 myParser.g4
 javac *.java

The GRUN command to run your parser is not:

 grun myParser -tokens access  infile

but

 grun my -tokens access infile

Adding "Parser" to the name always kills me when I split my grammer into seperate lexer/parser g4 files. I typicaly Use ANTLR4 get mediocre at at, then don't use it for 8-12 months and run into the same issues where I come here to Stack Overflow to get myself back on track.

This will show up in the grun -tokens as an "AT" token specifically. But as mentioned in the comments the AT needs to come first.

Any case where two rules can match "AT:'at'" is also a legal IDENTIFIER: [a-ZA-Z]+ put the smaller match first. ALSO I tend to avoid the * greedy matches and use the non greedy ? matches, even though I don't quite have my head around the specific mechanics of how ANTLR4 distinguishes between '' and '*?'. Future study for this student.

The other trick you can use is to use parser modes. I think the maintence overhead and complexity of parser modes is a bit high, but they can provide a work-around hack to solve a problem until you can get your head around a "proper" parsing solution. Thats how I use them today. A crutch to get my problem solved and I have //TODO -I need to fix this comments in my grammar. So if your parsing gets more complex, you could try lexer modes, but I think they are a risky crutch... and you can get far down a time sink rabbit hole with them. (I think I'm half way down one now).

But I find ANTLR4 is a wonderful parsing tool... although I think I may have been better off just hardcoding 'C'/Perl parsers than learning ANTLR4. The end result I'm finding is a grammar that can be more powerful I think than my falling back to my old 'C'/'Perl' brute force token readers. And it's orders of magnitude more productive than trying Lexx/Yacc was in the old days. I never got far enough down that path to consider them useful tools. ANTLR4 has been way more useful.

score 0 · Answer 2 · answered Nov 22 '21 at 19:10

The first grammar you mentioned works fine, this is the result:

The second:

access: identifier AT INT;
identifier: NAME | ~AT;
NAME: [A-Za-z]+;
INT: '-'? ([1-9][0-9]* | [0-9]);
AT: 'at';

produces indeed the error. This is because NAME and AT both match the text "at". And because NAME is defined before AT, a NAME token will be created.

Always be careful with such overlapping tokens: place keywords always above NAME or identifier tokens:

access: IDENTIFIER AT INT;
AT: 'at';
IDENTIFIER: [A-Za-z]+;
INT: '-'? ([1-9][0-9]* | [0-9]);

Note that ANTLR will only look at which rule is defined first when rules match the same amount of characters. So for input like "atat", an IDENTIFIER will be created (not 2 AT tokens!).

ANTLR4: Matching an identifier but NOT a keyword

2 Answers2