ANTLR4 grammar for SML choking on positive integer literals

Question

I'm building a parser for SML using ANTLR 4.8, and for some reason the generated parser keeps choking on integer literals:

# CLASSPATH=bin ./scripts/grun SML expression -tree <<<'1'
line 1:0 mismatched input '1' expecting {'(', 'let', 'op', '{', '()', '[', '#', 'raise', 'if', 'while', 'case', 'fn', LONGID, CONSTANT}
(expression 1)

I've trimmed as much as I can from the grammar to still show this issue, which appears very strange. This grammar shows the issue (despite LABEL not even being used):

grammar SML_Small;

Whitespace : [ \t\r\n]+ -> skip ;

expression : CONSTANT ;

LABEL : [1-9] NUM* ;

CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;

On the other hand, removing LABEL makes positive numbers work again:

grammar SML_Small;

Whitespace : [ \t\r\n]+ -> skip ;

expression : CONSTANT ;

CONSTANT : INT ;
INT : '~'? NUM ;
NUM : DIGIT+ ;
DIGIT : [0-9] ;

I've tried replacing NUM* with DIGIT? and similar variations, but that didn't fix my problem.

I'm really not sure what's going on, so I suspect it's something deeper than the syntax I'm using.

`1` is a `LABEL`, not a ``CONSTANT`, because lexemes can only be one thing and the first longest match is the one which wins. — rici, Aug 25 '20 at 00:27
@rici doh! That makes perfect sense, but I wasn’t thinking about it. — D. Ben Knoble, Aug 25 '20 at 01:01

score 0 · Answer 1 · answered Aug 25 '20 at 07:27

As already mentioned in the comments by Rici: the lexer tries to match as much characters as possible, and when 2 or more rules match the same characters, the one defined first "wins". So with rules like these:

LABEL    : [1-9] NUM* ;
CONSTANT : INT ;
INT      : '~'? NUM ;
NUM      : DIGIT+ ;
DIGIT    : [0-9] ;

the input 1 will always become a LABEL. And input like 0 will always be a CONSTANT. An INT token will only be created when a ~ is encountered followed by some digits. The NUM and DIGIT will never produce a token since the rules before it will be matched. The fact that NUM and DIGIT can never become tokens on their own, makes them candidates to becoming fragment tokens:

fragment NUM   : DIGIT+ ;
fragment DIGIT : [0-9] ;

That way, you can't accidentally use these tokens inside parser rules.

Also, making ~ part of a token is usually not the way to go. You'll probably also want ~(1 + 2) to be a valid expression. So an unary operator like ~ is often better used in a parser rule: expression : '~' expression | ... ;.

Finally, if you want to make a distinction between a non-zero integer value as a label, you can do it like this:

grammar SML_Small;


expression
 : '(' expression ')'
 | '~' expression
 | integer 
 ;

integer
 : INT
 | INT_NON_ZERO
 ;

label
 : INT_NON_ZERO
 ;

INT_NON_ZERO : [1-9] DIGIT* ;
INT          : DIGIT+ ;
SPACES       : [ \t\r\n]+ -> skip ;

fragment DIGIT : [0-9] ;

Actually, digit does show up elsewhere in the grammar, and num may as well. I’ll have to investigate the most appropriate fix for the more complex grammar, but this was very insightful. And `~` is a valid identifier in SML (an alias for `Int.~`, I believe), so the rules of function application in expressions permit `~ (1+2)`—but it is *also* a part of the language to express negative constants. — D. Ben Knoble, Aug 25 '20 at 11:46
Well, this gave me a large start and a rule to work from, but I have a long way to go on the complete grammar (lots of token overlap, currently). I'll mark this as accepted, though, since you solved the issue and enabled me to make progress. — D. Ben Knoble, Aug 25 '20 at 20:43

ANTLR4 grammar for SML choking on positive integer literals

1 Answers1