I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both
"Rob ""Commander Taco"" Malda is smart."
and
"Rob \"Commander Taco\" Malda is smart."
are both valid and equivalent. I've tried
StringLiteral : '"' ('""'|'\\"'|~["])* '"';
but if fails to match
"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""
with the tokenizer consuming more characters than intended, i.e. consumes beyond \""
Anyone who knows how to define the lexer rule?
A bit more detail...
""
succeeds""""
succeeds\" "
succeeds"\""
succeeds (at EOF)"\""\n""
fails (it greedily pulls in the\n
and"
Example: (text.txt)
""
""""
"\" "
"\""
""
grun test tokens -tokens < test.txt
line 5:1 token recognition error at: '"'
[@0,0:1='""',<StringLiteral>,1:0]
[@1,2:2='\n',<'
'>,1:2]
[@2,3:6='""""',<StringLiteral>,2:0]
[@3,7:7='\n',<'
'>,2:4]
[@4,8:12='"\" "',<StringLiteral>,3:0]
[@5,13:13='\n',<'
'>,3:5]
[@6,14:19='"\""\n"',<StringLiteral>,4:0]
[@7,21:20='<EOF>',<EOF>,5:2]
\""
and """
at the end of a StringListeral are not being handled the same.
Here's the ATN for that rule:
From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.
More research
Test Grammar (small change to simplify ATN):
grammar test
;
start: StringLiteral (WS? StringLiteral)+;
StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS: [ \t\n\r]+;
The ATN for StringLiteral in this grammar:
OK, let's walk through this ATN with the input "\""\n"
unconsumed input | transition |
---|---|
"\""\n" | 1 -ε-> 5 |
"\""\n" | 5 -"-> 11 |
\""\n" | 11 -ε-> 9 |
\""\n" | 9 -ε-> 6 |
\""\n" | 6 -\-> 7 |
""\n" | 7 -"-> 10 |
"\n" | 10 -ε-> 13 |
"\n" | 13 -ε-> 11 |
"\n" | 11 -ε-> 12 |
"\n" | 12 -ε-> 14 |
"\n" | 14 -"-> 15 |
\n" | 15 -ε-> 2 |
We should reach State 2
with the "
before the \n
, which would be the desired behavior.
Instead, we see it continue on to consume the \n
and the next "
line 2:1 token recognition error at: '"'
[@0,0:5='"\""\n"',<StringLiteral>,1:0]
[@1,7:6='<EOF>',<EOF>,2:2]
In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n
and a "
(and I'm not seeing it)
Maybe I'm missing something, but it's looking more and more like a bug to me.