2

I have come up with the following regex to be able to extract quotes from text:

"(?P<quote>.+?(?<![^\\]\\))"

It works ok on the above: https://regex101.com/r/NVjtW4/1.

However, I was wondering if there were any other "techniques" you could use to extract quoted texts. Perhaps with the following constraints:

  • Not using .+?
  • Without using a negative lookbehind (perhaps a negated character class instead).

Basically my question here is not, "What is the one way to do it?", but "What might be other alternatives" so I can see different possible approaches to solve what to me feels like a difficult and tricky regex to craft (escape one \ but not two \\, etc.)

Additionally, I want to check to see if there are an odd number of escapes preceding the quote:

".*?(?<=(\\{2})*)"

But this gives me an error of "* A quantifier inside a lookbehind makes it non-fixed width". Another one I had is:

"[^((\\{2})*")]+"

But this also doesn't match escaped quotes.

samuelbrody1249
  • 4,379
  • 1
  • 15
  • 58
  • There is further an [unrolled](http://www.softec.lu/site/RegularExpressions/UnrollingTheLoop) regex common like in [this answer: 3rd variant](https://stackoverflow.com/a/5696141/5527985). Probably the most efficient pattern. – bobble bubble Nov 06 '19 at 22:14

1 Answers1

3

This can be accomplished without using lazy quantifiers and lookbehinds:

See regex in use here

"(?<quote>(?:[^"\\]|(?:\\["\\])*)*)"

This works as follows:

  • (?:[^"\\]|(?:\\["\\])* Match either of the following options any number of times
    • [^"\\] - Option 1: Match any character except \ or "
    • (?:\\["\\])* - Option 2: Match \ followed by \ or ", any number of times
      • This matches the following cases \\, \\\\, \\\\\\, etc., and \", \\\", \\\\\", etc.
      • If you want it to also match cases like \a, change \\["\\] to \\.

The issue you have with the lookbehind (giving you the error "* A quantifier inside a lookbehind makes it non-fixed width") is because the regex engine you're using doesn't allow variable lookbehinds.

Some regex engines do allow this (e.g. regex engine for .NET), but most do not support this (e.g. PCRE doesn't support this). To get around this, some regex engines allow the use of a token \K that allows you to reset the match (?:\\{2})*\K

ctwheels
  • 21,901
  • 9
  • 42
  • 77