Regex for matching C++ string constant

Question

I'm currently working on a C++ preprocessor and I need to match string constants with more than 0 letters like this "hey I'm a string. I'm currently working with this one here \"([^\\\"]+|\\.)+\" but it fails on one of my test cases.

Test cases:

std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";

Expected output:

std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");

On the second one I instead get

std::cout << String("He said: \")bananas\"String(" << ")...";

Short repro code (using the regex by AR.3):

std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");

@LucasTrzesniewski Is there a difference? Both got \" and \\... — Niclas M, Jan 28 '17 at 11:41
Do you want to run this on a fairly limited set of C++ inputs, or is your ultimate goal to parse *any* valid C++ program? — Jongware, Jan 28 '17 at 11:41
@NiclasM `u8R"hello(this"is\a\""""single\\valid raw string literal)hello"` — Lucas Trzesniewski, Jan 28 '17 at 11:45
It's an exercise in futility to parse C++ source code with regex. You need a stateful lexer at the very least. — rustyx, Jan 28 '17 at 11:55
@RustyX I don't want to parse the whole thing. Just find Strings. — Niclas M, Jan 28 '17 at 11:56
I don't have PC handy to test so try this regex. `".*?([^\\]")`. See https://regex101.com/r/sLuToJ/1 — Saleem, Jan 28 '17 at 12:00
Also inside `/* "comments" */` (which can be nested), multiline and raw strings, skipping things like `'"'` etc.? — rustyx, Jan 28 '17 at 12:03
@NiclasM Glad it worked for you. I'm heading to work now and will update my answer once there. Probably in an hour. — Saleem, Jan 28 '17 at 12:10
I think you are going to find that getting this right is the *easist* part of building a preprocessor. And if you don't get the preprocessor exactly right, you won't be able to process real C++ programs. What is your actual goal? (You could just use Clang's or GCC's). — Ira Baxter, Jan 28 '17 at 18:18
*"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."* -- Jamie Zawinski — G. Sliepen, Jan 28 '17 at 21:14
You may want to check [my post here](http://stackoverflow.com/q/36454069/2064981). It deals with detecting comments to set them apart from strings and may help you on your way. — SamWhan, Jan 29 '17 at 00:10
@IraBaxter Yet I was finished with everything I needed except for the strings. My ultimate goal? Writing a tool that allows me being lazy. Why I'm not using the CPP? Because its not capable of string manipulation. — Niclas M, Jan 29 '17 at 14:58
You didn't describe your ultimate goal. Surely, building a preprocessor isn't it. — Ira Baxter, Jan 29 '17 at 16:22
@IraBaxter `In computer science, a preprocessor is a program that processes its input data to produce output that is used as input to another program.` (Wikipedia). That's exactly what I want. — Niclas M, Jan 29 '17 at 18:16
(??Yes, I know what a preprocessor is, including specifically what a C++ preprocessor is). Usually people dont build a preprocessor for C++ without some idea of what they are going to do with it. — Ira Baxter, Jan 30 '17 at 14:34
@IraBaxter Yeah, and I need a preprocessor that puts that stupid paren's around strings. — Niclas M, Jan 30 '17 at 14:37

Lucas Trzesniewski · Answer 1 · 2017-01-29T00:59:15.517

Lexing a source file is a good job for regexes. But for such a task, let's use a better regex engine than std::regex. Let's use PCRE (or boost::regex) at first. At the end of this post, I'll show what you can do with a less feature-packed engine.

We only need to do partial lexing, ignoring all unrecognized tokens that won't affect string literals. What we need to handle is:

Singleline comments
Multiline comments
Character literals
String literals

We'll be using the extended (x) option, which ignores whitespace in the pattern.

Comments

Here's what [lex.comment] says:

The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]

# singleline comment
// .* (*SKIP)(*FAIL)

# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

Easy peasy. If you match anything there, just (*SKIP)(*FAIL) - meaning that you throw away the match. The (?s: .*? ) applies the s (singleline) modifier to the . metacharacter, meaning it's allowed to match newlines.

Character literals

Here's the grammar from [lex.ccon]:

 character-literal:  
    encoding-prefix(opt) ’ c-char-sequence ’
  encoding-prefix:
    one of u8 u U L
  c-char-sequence:
    c-char
    c-char-sequence c-char
  c-char:
    any member of the source character set except the single-quote ’, backslash \, or new-line character
    escape-sequence
    universal-character-name
  escape-sequence:
    simple-escape-sequence
    octal-escape-sequence
    hexadecimal-escape-sequence
  simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
  octal-escape-sequence:
    \ octal-digit
    \ octal-digit octal-digit
    \ octal-digit octal-digit octal-digit
  hexadecimal-escape-sequence:
    \x hexadecimal-digit
    hexadecimal-escape-sequence hexadecimal-digit

Let's define a few things first, which we'll need later on:

(?(DEFINE)
  (?<prefix> (?:u8?|U|L)? )
  (?<escape> \\ (?:
    ['"?\\abfnrtv]         # simple escape
    | [0-7]{1,3}           # octal escape
    | x [0-9a-fA-F]{1,2}   # hex escape
    | u [0-9a-fA-F]{4}     # universal character name
    | U [0-9a-fA-F]{8}     # universal character name
  ))
)

prefix is defined as an optional u8, u, U or L
escape is defined as per the standard, except that I've merged universal-character-name into it for the sake of simplicity

Once we have these, a character literal is pretty simple:

(?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

We throw it away with (*SKIP)(*FAIL)

Simple strings

They're defined in almost the same way as character literals. Here's a part of [lex.string]:

  string-literal:
    encoding-prefix(opt) " s-char-sequence(opt) "
    encoding-prefix(opt) R raw-string
  s-char-sequence:
    s-char
    s-char-sequence s-char
  s-char:
    any member of the source character set except the double-quote ", backslash \, or new-line character
    escape-sequence
    universal-character-name

This will mirror the character literals:

(?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "

The differences are:

The character sequence is optional this time (* instead of +)
The double quote is disallowed when unescaped instead of the single quote
We actually don't throw it away :)

Raw strings

Here's the raw string part:

  raw-string:
    " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
  r-char-sequence:
    r-char
    r-char-sequence r-char
  r-char:
    any member of the source character set, except a right parenthesis )
    followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
  d-char-sequence:
    d-char
    d-char-sequence d-char
  d-char:
    any member of the basic source character set except:
    space, the left parenthesis (, the right parenthesis ), the backslash \,
    and the control characters representing horizontal tab,
    vertical tab, form feed, and newline.

The regex for this is:

(?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "

[^ ()\\\t\x0B\r\n]* is the set of characters that are allowed in delimiters (d-char)
\k<delimiter> refers to the previously matched delimiter

The full pattern

The full pattern is:

(?(DEFINE)
  (?<prefix> (?:u8?|U|L)? )
  (?<escape> \\ (?:
    ['"?\\abfnrtv]         # simple escape
    | [0-7]{1,3}           # octal escape
    | x [0-9a-fA-F]{1,2}   # hex escape
    | u [0-9a-fA-F]{4}     # universal character name
    | U [0-9a-fA-F]{8}     # universal character name
  ))
)

# singleline comment
// .* (*SKIP)(*FAIL)

# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

# character literal
| (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

# standard string
| (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "

# raw string
| (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "

See the demo here.

`boost::regex`

Here's a simple demo program using boost::regex:

#include <string>
#include <iostream>
#include <boost/regex.hpp>

static void test()
{
    boost::regex re(R"regex(
        (?(DEFINE)
          (?<prefix> (?:u8?|U|L) )
          (?<escape> \\ (?:
            ['"?\\abfnrtv]         # simple escape
            | [0-7]{1,3}           # octal escape
            | x [0-9a-fA-F]{1,2}   # hex escape
            | u [0-9a-fA-F]{4}     # universal character name
            | U [0-9a-fA-F]{8}     # universal character name
          ))
        )

        # singleline comment
        // .* (*SKIP)(*FAIL)

        # multiline comment
        | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

        # character literal
        | (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)

        # standard string
        | (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "

        # raw string
        | (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
    )regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);

    std::string subject(R"subject(
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";

"" // empty string
'"' // character literal

// this is "a string literal" in a comment
/* this is
   "also inside"
   //a comment */

// and this /*
"is not in a comment"
// */

"this is a /* string */ with nested // comments"
    )subject");

    std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
}

int main(int argc, char **argv)
{
    try
    {
        test();
    }
    catch(std::exception ex)
    {
        std::cerr << ex.what() << std::endl;
    }

    return 0;
}

(I left syntax highlighting disabled because it goes nuts on this code)

For some reason, I had to take the ? quantifier out of prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)?) to make the pattern work. I believe it's a bug in boost::regex, as both PCRE and Perl work just fine with the original pattern.

What if we don't have a fancy regex engine at hand?

Note that while this pattern technically uses recursion, it never nests recursive calls. Recursion could be avoided by inlining the relevant reusable parts into the main pattern.

A couple of other constructs can be avoided at the price of reduced performance. We can safely replace the atomic groups (?>...) with normal groups (?:...) if we don't nest quantifiers in order to avoid catastrophic backtracking.

We can also avoid (*SKIP)(*FAIL) if we add one line of logic into the replacement function: All the alternatives to skip are grouped in a capturing group. If the capturing group matched, just ignore the match. If not, then it's a string literal.

All of this means we can implement this in JavaScript, which has one of the simplest regex engines you can find, at the price of breaking the DRY rule and making the pattern illegible. The regex becomes this monstrosity once converted:

(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"

And here's an interactive demo you can play with:

function run() {
    var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
    
    var input = document.getElementById("input").value;
    var output = input.replace(re, function(m, ignore) {
        return ignore ? m : "String(" + m + ")";
    });
    document.getElementById("output").innerText = output;
}

document.getElementById("input").addEventListener("input", run);
run();

<h2>Input:</h2>
<textarea id="input" style="width: 100%; height: 50px;">
std::cout << L"hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";
std::cout << u8R"hello(this"is\a\""""single\\(valid)"
raw string literal)hello";

"" // empty string
'"' // character literal

// this is "a string literal" in a comment
/* this is
   "also inside"
   //a comment */

// and this /*
"is not in a comment"
// */

"this is a /* string */ with nested // comments"
</textarea>
<h2>Output:</h2>
<pre id="output"></pre>

I've opened a boost bug report [here](https://svn.boost.org/trac/boost/ticket/12797). — Lucas Trzesniewski, Jan 29 '17 at 14:50

score 3 · Answer 2 · answered Jan 28 '17 at 16:12

Regular expressions can be tricky for beginners but once you understand it's basics and well tested divide and conquer strategy, it will be your goto tool.

What you need to search for quote (") not starting with () back slash and read all characters upto next quote.

The regex I came up is (".*?[^\\]"). See a code snippet below.

std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";

std::regex re(R"((".*?[^\\]"))");
in_line = std::regex_replace(in_line, re, "String($1)");

std::cout << in_line << endl;

Output:

std::cout << String("He said: \"bananas\"") << String("...");

Regex Explanation:

(".*?[^\\]")

Options: Case sensitive; Numbered capture; Allow zero-length matches; Regex syntax only

Match the regex below and capture its match into backreference number 1 (".*?[^\\]")
- Match the character “"” literally "
- Match any single character that is NOT a line break character (line feed, carriage return) .*?
  - Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
- Match any character that is NOT the backslash character [^\\]
- Match the character “"” literally "

String($1)

Insert the character string “String” literally String
Insert an opening parenthesis (
Insert the text that was last matched by capturing group number 1 $1
Insert a closing parenthesis )

This solution does not match empty Strings (""). Small alteration finds also those: (".*?[^\\\]?") Explanation: [^\\\] demands at least one character that is not backslash. Adding ? after this part makes it optional. — Bojan Hrnkas, Jul 24 '20 at 07:42

Roland Illig · Answer 3 · 2017-01-28T12:13:46.657

2

Read the relevant sections from the C++ standard, they are called lex.ccon and lex.string.

Then convert each rule you find there into a regular expression (if you really want to use regular expressions; it might turn out that they are not capable of doing this job).

Then, build more complicated regular expressions out of them. Be sure to name your regular expressions exactly as the rules from the C++ standard, so that you can recheck them later.

If, instead of using regular expressions, you want to use an existing tool, here is one: http://clang.llvm.org/doxygen/Lexer_8cpp_source.html. Have a look at the LexStringLiteral function.

edited Jan 28 '17 at 12:13

answered Jan 28 '17 at 11:57

Roland Illig

40,703
10
88
121

I don't know about you, but I ain't got time for that. What would be alternatives for regexes? – Niclas M Jan 28 '17 at 11:59
Use an existing library that parses C++ string literals. It could be in the source code of a compiler, or a source code formatter, or an IDE that has syntax highlighting. All these programs have already solved this problem. When looking through them, start with those that you trust the most for handling all edge cases properly. – Roland Illig Jan 28 '17 at 12:06
That's what I first thought of but then I wanted to learn regexes some bit more. – Niclas M Jan 28 '17 at 12:08
1

You've just learned the most important thing about regular expressions: they cannot solve all tasks. Oh, no, wait. You haven't learned that yet. To actually learn it, try to parse C++ string literals. Even if you fail, you will know much more about regular expressions afterwards. – Roland Illig Jan 28 '17 at 12:10
1

@RolandIllig sure regexes can't solve all tasks, but I'm getting tired of people saying that regexes are not appropriate even for tasks they were *designed* for. C++ literals are a Chomsky Type-3 language, they are regular and regexes are just fine for this. I tried to follow the standard in my answer just to show it's no problem at all. – Lucas Trzesniewski Jan 28 '17 at 13:33
2

@LucasTrzesniewski C++ raw string literals are _not_ Chomsky Type 3 because of the backreference `\2`. Therefore traditional regular expressions cannot match them. Enhanced regular expressions (like in Perl, JavaScript) _can_ match them, though, but these are not _regular_ anymore. – Roland Illig Jan 29 '17 at 01:38