6

The lark parser predefines some common terminals, including a string. It is defined as follows:

_STRING_INNER: /.*?/
_STRING_ESC_INNER: _STRING_INNER /(?<!\\)(\\\\)*?/ 

ESCAPED_STRING : "\"" _STRING_ESC_INNER "\""

I do understand _STRING_INNER. I also understand how ESCAPED_STRING is composed. But what I don't really understand is _STRING_ESC_INNER.

If I read the regex correctly, all it says is that whenever I find two consecutive literal backslashes, they must not be preceeded by another literal backslash?

How can I combine those two into a single regex?

And wouldn't it be required for the grammar to only allow escaped double quotes in the string data?

flowit
  • 1,382
  • 1
  • 10
  • 36

1 Answers1

7

Preliminaries:

  • .*? Non-greedy match, meaning the shortest possible number of repetitions of . (any symbol). This only makes sense when followed by something else. So .*?X on input AAXAAX would match only the AAX part, instead of expanding all the way to the last X.

  • (?<!...) is a "negative look-behind assertion" (link): "Matches if the current position in the string is not preceded by a match for ....". So .*(?<!X)Y would match AY but not XY.

Applying this to your example:

  • ESCAPED_STRING: The rule says: "Match ", then _STRING_ESC_INNER, and then " again".

  • _STRING_INNER: Matches the shortest possible number of repetitions of any symbol. As said before, this only makes sense when considering the regular expression that comes after it.

  • _STRING_ESC_INNER: We want this to match the shortest possible string that does not contain a closing quote. That is, for an input "abc"xyz", we want to match "abc", instead of also consuming the xyz" part. However, we have to make sure that the " is really a closing quote, in that it should not be itself escaped. So for input "abc\"xyz", we do not want to match only "abc\", because the \" is escaped. We observe that the closing " has to be directly preceded by an even number of \ (with zero being an even number). So " is ok, \\" is ok, \\\\" is ok etc. But as soon as " is preceded by an odd number of \, that means the " is not really a closing quote.

    (\\\\) matches \\. The (?<!\\) says "the position before should not have \". So combined (?<!\\)(\\\\) means "match \\, but only if it is not preceded by \".

    The following *? then does the smallest possible repetitions of this, which again only makes sense when considering the regular expression that comes after this, which is the " from the ESCAPED_STRING rule (possible point of confusion: the \" in the ESCAPED_STRING refers to a literal " in the actual input we want to match, in the same way that \\\\ refers to \\ in the input). So (?<!\\)(\\\\)*?\" means "match the shortest amount of \\ that is followed by " and not preceded by \. So in other words, (?<!\\)(\\\\)*?\" matches only " that are preceded by an even number of \ (including blocks of size 0).

    Now combining it with the preceding _STRING_INNER, the _STRING_ESC_INNER rule then says: Match the first " preceded by an even number of \, so in other words, the first " where the \ is not itself escaped.

  • Thanks. But why do we want to match an escaped quote `\"`? That basically means the string is not yet complete and there are more characters to be consumed. – flowit Apr 22 '20 at 20:21
  • Got it. When you write `\"` you actually mean the literal `"`, just in regex style so it has to be escaped. – flowit Apr 22 '20 at 20:27
  • Yes, I'm sorry, that was poorly worded on my part. I edited the answer to make it more clear. So the `\"` in the program code corresponds to a `"` in the input. – f9c69e9781fa194211448473495534 Apr 22 '20 at 20:31
  • I'm also confused about `\`. why does `_STRING_INNER` have two backslashes? in `/.*?/`? – Charlie Parker Jul 07 '22 at 16:14
  • @CharlieParker I'm not fully sure what you mean? If your question is why there are two forward slashes in `_STRING_INNER: /.*?/`, this just the syntax for specifying a regular expression in Lark (https://lark-parser.readthedocs.io/en/latest/grammar.html#terminals). I guess this was chosen to make it easier to differentiate between regular strings (`"somestring"`) and regular expressions (`/someregex/`). – f9c69e9781fa194211448473495534 Jul 07 '22 at 17:31
  • In general, you can refer to https://stackoverflow.com/questions/15661969/what-does-the-forward-slash-mean-within-a-javascript-regular-expression for some more background on this convention. – f9c69e9781fa194211448473495534 Jul 07 '22 at 17:37