Lexer rule to handle escape of quote with quote or backslash in ANTLR4?

Question

I'm trying to expand the answer to How do I escape an escape character with ANTLR 4? to work when the " can be escaped both with " and \. I.e. both

"Rob ""Commander Taco"" Malda is smart."

and

"Rob \"Commander Taco\" Malda is smart."

are both valid and equivalent. I've tried

StringLiteral : '"' ('""'|'\\"'|~["])* '"';

but if fails to match

"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""

with the tokenizer consuming more characters than intended, i.e. consumes beyond \""

Anyone who knows how to define the lexer rule?

A bit more detail...

"" succeeds
"""" succeeds
\" " succeeds
"\"" succeeds (at EOF)
"\""\n"" fails (it greedily pulls in the \n and "

Example: (text.txt)

""
""""
"\" "
"\""
""

grun test tokens -tokens < test.txt

line 5:1 token recognition error at: '"'
[@0,0:1='""',<StringLiteral>,1:0]
[@1,2:2='\n',<'
'>,1:2]
[@2,3:6='""""',<StringLiteral>,2:0]
[@3,7:7='\n',<'
'>,2:4]
[@4,8:12='"\" "',<StringLiteral>,3:0]
[@5,13:13='\n',<'
'>,3:5]
[@6,14:19='"\""\n"',<StringLiteral>,4:0]
[@7,21:20='<EOF>',<EOF>,5:2]

\"" and """ at the end of a StringListeral are not being handled the same.

Here's the ATN for that rule:

From this diagram it's not clear why they should be handled differently. They appear to be parallel constructs.

More research

Test Grammar (small change to simplify ATN):

grammar test
    ;

start: StringLiteral (WS? StringLiteral)+;

StringLiteral: '"' ( (('\\' | '"') '"') | ~["])* '"';
WS:            [ \t\n\r]+;

The ATN for StringLiteral in this grammar:

OK, let's walk through this ATN with the input "\""\n"

unconsumed input	transition
"\""\n"	1 -ε-> 5
"\""\n"	5 -"-> 11
\""\n"	11 -ε-> 9
\""\n"	9 -ε-> 6
\""\n"	6 -\-> 7
""\n"	7 -"-> 10
"\n"	10 -ε-> 13
"\n"	13 -ε-> 11
"\n"	11 -ε-> 12
"\n"	12 -ε-> 14
"\n"	14 -"-> 15
\n"	15 -ε-> 2

We should reach State 2 with the " before the \n, which would be the desired behavior.

Instead, we see it continue on to consume the \n and the next "

line 2:1 token recognition error at: '"'
[@0,0:5='"\""\n"',<StringLiteral>,1:0]
[@1,7:6='<EOF>',<EOF>,2:2]

In order for this to be valid, there must be a path from state 11 to state 2 that consumes a \n and a " (and I'm not seeing it)

Maybe I'm missing something, but it's looking more and more like a bug to me.

Hmmm... this one has me stumped. It took a bit to reproduce your problem, but when I do, I can't see why this doesn't work. I've taken the liberty of adding more specific details from looking into it. Maybe that make it easier for someone else to spot the problem (or, just maybe?, confirm a bug?) — Mike Cargal, May 08 '22 at 22:36
@user2052153 don't you want to do either `'"' ('""'|'\\"'|~[\\"])* '"'` or `'"' ('""'|'\\"'|~["])*? '"'` instead? — Bart Kiers, May 09 '22 at 06:51
All work on my machine (including your original `StringLiteral`: check my answer). What version of ANTLR are you using? — Bart Kiers, May 09 '22 at 18:48
More convinced that this doesn't look right. I've added my reasoning to the question. — Mike Cargal, May 09 '22 at 19:19
Given `"\""\n"` gets matched in its entirety could be explained because `( (('\\' | '"') '"') | ~["])*` matches greedily. The inner characters, `\""\n`, are matched as follows `\\ ` (by `~["]`) and then `""` (by `('\\' | '"') '"'`) and finally `\n` (again by `~["]`). When you do `( (('\\' | '"') '"') | ~["])*?` it is different. — Bart Kiers, May 09 '22 at 20:33
Failed to add this as a post, refused due to formatting not matching expectations and I don't know how to make stackoverflow accept it. Tested with 4.8 & 4.9.3 for C++ target (4.10.1 won't build) `'"' ('""'|'\\"'|~["])*? '"'` - Fails at double quotes, i.e. "diagnose "fuel cut-off"". The token is terminated at first "" `'"' ('""'|'\\"'|~["])* '"'` - Fails at \\". For example with "Flag for \\"Chiller Water\\"" the tokenizer continues to consume after last \" '`"' ('""'|'\\"'|~[\\"])* '"'` - Fails when string contains other backslahes, for example "Delay \\max ratio" — user2052153, May 10 '22 at 04:52

score 0 · Answer 1 · answered May 09 '22 at 18:46

I cannot reproduce it. Given the grammar:

grammar T;

parse
 : .*? EOF
 ;

StringLiteral
 : '"' ( '""' | '\\"' | ~["] )* '"'
 ;

Other
 : . -> skip
 ;

The following code:

String source =
    "\"Rob \"\"Commander Taco\"\" Malda is smart.\"\n" +
    "\"Rob \\\"Commander Taco\\\" Malda is smart.\"\n" +
    "\"Entry Flag for Offset check and for \\\"don't start Chiller Water Pump Request\\\"\"\n";

TLexer lexer = new TLexer(CharStreams.fromString(source));
CommonTokenStream stream = new CommonTokenStream(lexer);

stream.fill();

for (Token t : stream.getTokens()) {
    System.out.printf("%-20s '%s'\n",
        TLexer.VOCABULARY.getSymbolicName(t.getType()),
        t.getText().replace("\n", "\\n"));
}

produces the following output:

StringLiteral        '"Rob ""Commander Taco"" Malda is smart."'
StringLiteral        '"Rob \"Commander Taco\" Malda is smart."'
StringLiteral        '"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""'

Tested with ANTLR 4.9.3 and 4.10.1: both produce the same output.

try ending the second line with `smart.\\\"\"\n"` instead of `smart.\"\n"`. (the `\""` ending works fine at EOF, but it broke (for me when I tried it on a line with content following. (I'll try the change myself later, but have something to do right now) — Mike Cargal, May 09 '22 at 19:40
That may well be (and I will try that tomorrow since it's getting late here), but the OP said `"Entry Flag for Offset check and for \"don't start Chiller Water Pump Request\""` failed to tokenize, which I cannot reproduce. — Bart Kiers, May 09 '22 at 20:15
OK, just tried it with the change I suggested, and the second `StringLiteral` is printed out as `'"Rob \"Commander Taco\" Malda is smart.\""\n"'`. (I still don't get what's going on to make that happen) — Mike Cargal, May 09 '22 at 20:50

Mike Cargal · Accepted Answer · 2022-05-10T20:13:50.070

The problem is handling the \ properly.

Bart found the path through the ATN that I missed and allowed it to match the extra \n". The \ is matched as a ~["] and then comes back through and matches the " to terminate the string.

We could disallow \ in the "everything but a " alternative (~["\\]), but then we have to allow a stand-alone \ to be acceptable. We'd want to add an alternative that allows a \ followed by anything other than a ". You'd think that '\\' ~["] does that, and you'd be right, to a point, but it also consumes the character following the \, which is a problem if you want a string like "test \\" string" since it's consumed the second \ you can't match the \" alternative. What you're looking for is a lookahead (i.e. consume the \ if it's not followed by a ", but don't consume the following character). But ANTLR Lexer rules don't allow for lookaheads (ANTLR lexer can't lookahead at all).

You'll notice that most grammars that allow \" as an escape sequence in strings also require a bare \ to be escaped (\\), and frequently treat other \ (other character) sequences as just the "other character").

If escaping the \ character is acceptable, the rule could be simplified to:

StringLiteral: '"' ('\\' . | '""' | ~["\\])* '"';

"Flag for \\"Chiller Water\\"" would not parse correctly, but "Flag for \\\"Chiller Water\\\"" would. Without lookahead, I'm not seeing a way to Lex the first version.

Also, note that if you don't escape the \, then you have an ambiguous interpretation of \"". Is it \" followed by a " to terminate the string, or \ followed by "" allowing the string to continue? ANTLR will take whichever interpretation consumes the most input, so we see it using the second interpretation and pulling in characters until if finds a "

The syntax of the format I'm parsing is unfortunately fixed since many years by a large international organization so changing it isn't an option. Right now I'm thinking about pre-processing the data before feeding it to ANTLR. The preprocessing would replace all \" with "". Thoughts on that solution? — user2052153, May 10 '22 at 19:30
It may be necessary... see the note I just added. That thought probably did more for me to clarify the issue than all the other tortured "deep thinking" :) — Mike Cargal, May 10 '22 at 20:15

Lexer rule to handle escape of quote with quote or backslash in ANTLR4?

More research

2 Answers2