0

I'm a regex noob attempting to match either the contents or the entirety of a quoted segment of text without breaking on escaped quotation marks.

Put another way, I need a regex that, between two question marks, will match all characters that are not quotation marks and also any quotation mark that has an odd number of consecutive backslashes preceding it. It has to be an odd number of backslashes as a pair of backslashes escapes to a single backslash.

I've successfully created a regex that does this but it relied on look-behind and because this project is in C++ and because the regex implementation of standard C++ does not have look-behind functionality, I could not use said regex.

Here is the regex with look-behind that I came up with: "(((?<!\\)(\\\\)*\\"|[^"])*)"

The following text should produce 8 matches:

"Woah. Look. A  tab."
"This \\\\\\\\\\\\\" is all one string"
"This \"\"\"\" is\" also\"\\ \' one\"\\\" string."
"These \\""are separate strings"
"The cat said,\"Yo.\""
"
\"Shouldn't it work on multiple lines?\" he asked rhetorically.
\"Of course it should.\"
"
"If you don't have exactly 8 matches, then you've failed."

Here's a picture of my (probably naive) look-behind version for the visual people among you (You know who you are): enter image description here

And here's a link to this example: https://regex101.com/r/uOxqWl/1

If this is impossible to do without look-behind, please let me know. Also, if there is a well-regarded C++ regex library that allows regex look-behind, please let me know (It doesn't have to be ECMAScript, though I would slightly prefer that).

  • C++ regex in standard library does not support lookbehind: https://stackoverflow.com/questions/14538687/using-regex-lookbehinds-in-c11. There are some libraries like Boost.Regex that do it. – jignatius Jan 23 '20 at 06:21
  • Maybe it could help to look at "Compile time regex": github.com/hanickadot/compile-time-regular-expressions – A M Jan 23 '20 at 08:54
  • I am a bit puzzled. Why did you ever need lookbehind? An English description of a C-style string translates straightforwardly to a standard garden variety regex without any special features. – n. m. could be an AI Jan 23 '20 at 10:18
  • @n.'pronouns'm. Cool, dude. I'm glad it's very simple for you. Care to share a regex that works? Also, if you must know, I'm creating a small compiler and I want to be able to read something like "let x = "This is a \"string\" with quotes."" without tokenizing string elements. – Nicholas Bonjour Jan 23 '20 at 12:26
  • @jignatius I am painfully aware that C++ does not support look-behind (as I said several times in my post), but thank you for the library suggestion. – Nicholas Bonjour Jan 23 '20 at 12:30
  • @ArminMontigny that's an interesting library for sure. I'll check it out. Thanks! – Nicholas Bonjour Jan 23 '20 at 12:34

1 Answers1

1

Let's derive a garden variety regular expression for C-style strings from an English description.

A string is a quotation mark, followed by a sequence of string-characters, followed by another quotation mark.

std::regex stringMatcher ( R"("<string-character>*")" );

Obviously this doesn't work as we didn't define the string-character yet. We can do so piece by piece.

Firstly, a string character could be any character except a quotation mark and a backslash.

 R"([^\\"])"

Secondly, a string character could be an escape sequence consisting of a backslash and a single other character from a fixed set.

 R"(\\[abfnrtv'"\\?])"

Thirdly, it can be an octal escape sequence that consists of a backslash and three octal digits

 R"(\\[0-7][0-7][0-7])"

(We simplify here a bit because the real C standard allows 1, 2 or 3 octal digits. This is easy to add.)

Fourthly, it can be a hexadecimal escape sequence that consists of a backslash, a letter x, and a hexadecimal number. The range of the number is implementation defined, so we need to accept any one.

 R"(\\x[0-9a-fA-F][0-9a-fA-F]*)"

We omit universal character names, they could be added in an exactly the same way. There are none in the given test example.

So, to bring this all together:

 std::regex stringMatcher ( R"("([^\\"]|\\([abfnrtv'"\\?]|[0-7][0-7][0-7]|x[0-9a-fA-F][0-9a-fA-F]*))*")" ); 
// collapsed the leading backslashes of all the escape sequence types together

Live demo.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243