I am having trouble distinguishing a keyword from a non-keyword when a grammar allows the non-keyword to have a similar "look" to the keyword.
Here's the grammar:
grammar Query;
options {
output = AST;
backtrack = true;
}
tokens {
DefaultBooleanNode;
}
// Parser
startExpression : expression EOF ;
expression : withinExpression ;
withinExpression
: defaultBooleanExpression
(WSLASH^ NUMBER defaultBooleanExpression)*
defaultBooleanExpression
: (queryFragment -> queryFragment)
(e=queryFragment -> ^(DefaultBooleanNode $defaultBooleanExpression $e))*
;
queryFragment : unquotedQuery ;
unquotedQuery : UNQUOTED | NUMBER ;
// Lexer
WSLASH : ('W'|'w') '/';
NUMBER : Digit+ ('.' Digit+)? ;
UNQUOTED : UnquotedStartChar UnquotedChar* ;
fragment UnquotedStartChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '/' | '(' | ')' | '[' | ']'
| '{' | '}' | '-' | '+' | '~' | '&' | '|'
| '!' | '^' | '?' | '*' )
;
fragment UnquotedChar
: EscapeSequence
| ~( ' ' | '\r' | '\t' | '\u000C' | '\n' | '\\'
| ':' | '"' | '(' | ')' | '[' | ']' | '{'
| '}' | '~' | '&' | '|' | '!' | '^' | '?'
| '*' )
;
fragment EscapeSequence
: '\\'
( 'u' HexDigit HexDigit HexDigit HexDigit
| ~( 'u' )
)
;
fragment Digit : ('0'..'9') ;
fragment HexDigit : ('0'..'9' | 'a'..'f' | 'A'..'F') ;
WHITESPACE : ( ' ' | '\r' | '\t' | '\u000C' | '\n' ) { skip(); };
I have simplified it enough to get rid of the distractions but I think removing any more would remove the problem.
- A slash is permitted in the middle of an unquoted query fragment.
- Boolean queries in particular have no required keyword.
- A new syntax (e.g. W/3) is being introduced but I'm trying not to affect existing queries which happen to look similar (e.g. X/Y)
- Due to '/' being valid as part of a word, ANTLR appears to be giving me "W/3" as a single token of type UNQUOTED instead of it being a WSLASH followed by a NUMBER.
- Due to the above, I end up with a tree like: DefaultBooleanNode(DefaultBooleanNode(~first clause~, "W/3"), ~second clause~), whereas what I really wanted was WSLASH(~first clause~, "3", ~second clause~).
What I would like to do is somehow write the UNQUOTED rule as "what I have now, but not matching ~~~~", but I'm at a loss for how to do that.
I realise that I could spell it out in full, e.g.:
- Any character from UnquotedStartChar except 'w', followed by the rest of the rule
- 'w' followed by any character from UnquotedChar except '/', followed by the rest of the rule
- 'w/' followed by any character from UnquotedChar except digits
- ...
However, that would look awful. :)