13

Say you have a string which is provided by the user. It can contain any kind of character. Examples are:

std::string s1{"hello world");
std::string s1{".*");
std::string s1{"*{}97(}{.}}\\testing___just a --%#$%# literal%$#%^"};
...

Now I want to search in some text for occurrences of >> followed by the input string s1 followed by <<. For this, I have the following code:

std::string input; // the input text
std::regex regex{">> " + s1 + " <<"};

if (std::regex_match(input, regex)) {
     // add logic here
}

This works fine if s1 did not contain any special characters. However, if s1 had some special characters, which are recognized by the regex engine, it doesn't work.

How can I escape s1 such that std::regex considers it as a literal, and therefore does not interpret s1? In other words, the regex should be:

std::regex regex{">> " + ESCAPE(s1) + " <<"};

Is there a function like ESCAPE() in std?

important I simplified my question. In my real case, the regex is much more complex. As I am only having troubles with the fact the s1 is interpreted, I left these details out.

Karel Demeester
  • 155
  • 1
  • 1
  • 7
  • 2
    Is there a particular reason for using regular expressions? This can be done with `string.find` – Austin Brunkhorst Oct 22 '16 at 18:11
  • Why are you using _the input string_ (`s1`) as a regex?? Perhaps something like `std::regex Regex{">>*<<"};` would be better? – ForceBru Oct 22 '16 at 18:12
  • @Austin Brunkhorst Yes, I need regular expressions in my case. I simplified the question, as I am only struggling with the fact that the string is interpreted by the engine. In my real case, the regex is more complex. – Karel Demeester Oct 22 '16 at 18:12
  • @ForceBru There are two inputs: `s1` and some text. The string `s1` is used to dynamically construct the regex, which is used to search in the input text. – Karel Demeester Oct 22 '16 at 18:14
  • I would be tempted to construct a regex to extract the testable portion and then directly compare the string: `std::regex regex{">> (.*?) <<"}; ... if(match.str(1) == s1)...` could that work in your situation? – Galik Oct 22 '16 at 18:40
  • 1
    [This is related to "How to escape a string for use in Boost Regex"](http://stackoverflow.com/questions/1252992/how-to-escape-a-string-for-use-in-boost-regex) since std::regex was largely based on boost::regex. You might consult that question for an answer. – Cornstalks Oct 22 '16 at 18:40
  • It's a shame that C++ doesn't have a dedicated function for quoting/escaping a literal string into regexp (like [`Pattern.quote` in Java](https://docs.oracle.com/en/java/javase/12/docs/api/java.base/java/util/regex/Pattern.html#quote(java.lang.String)) or [`Regex.Escape` in .NET](https://learn.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.escape)). – Sasha Jan 17 '22 at 14:49
  • @Wiktor I don't think this is a duplicate, since the other question is using boost but this one isn't. – Donald Duck May 05 '22 at 13:28

1 Answers1

10

You will have to escape all special characters in the string with \. The most straightforward approach would be to use another expression to sanitize the input string before creating the expression regex.

// matches any characters that need to be escaped in RegEx
std::regex specialChars { R"([-[\]{}()*+?.,\^$|#\s])" };

std::string input = ">> "+ s1 +" <<"; 
std::string sanitized = std::regex_replace( input, specialChars, R"(\$&)" );

// "sanitized" can now safely be used in another expression
Austin Brunkhorst
  • 20,704
  • 6
  • 47
  • 61
  • Do you really need to escape `^` here? And are you including all whitespace to deal with newlines or something? A little explanation would be useful. Also, don't you need to escape `\` too? – Cornstalks Oct 22 '16 at 18:42
  • `^` is matched for the sake of completeness - obviously it will never match the start of a line with the preceding `>> `, but OP said the example is simplified. Can you elaborate on what you mean about whitespace and newlines? – Austin Brunkhorst Oct 22 '16 at 18:47
  • 2
    Including `^` makes sense, but you've escaped it with a backslash. I'm curious why you've escaped it with a backslash in this situation. Also, you've included `\s`, which matches whitespace, but I'm not sure why you'd need that (maybe newline handling? I dunno; I can't remember how std::regex handles newlines and whether escaping them or not makes a difference). And in my previous comment I tried to say ``\`` should be included in `specialChars` too, but Markdown ate it. – Cornstalks Oct 22 '16 at 18:51
  • 2
    Also about `#`, should that be escaped? – Karel Demeester Oct 22 '16 at 19:11
  • 2
    And `,`? Why should that be escaped? – Karel Demeester Oct 22 '16 at 19:12
  • You do not need to escape `,`, nor `#`, nor whitespace. – Wiktor Stribiżew Oct 25 '16 at 13:43
  • 1
    Why don't you escape backslash? – user541686 May 16 '18 at 21:29
  • 3
    All one needs to escape with ` \ ` are these characters: `[\^$.|?*+(){}`. As [per](https://www.regular-expressions.info/refcharacters.html). – c00000fd Dec 17 '18 at 10:17
  • 1
    @c00000fd this list is unsufficient for e.g. `[({])}` input – Mikhail Apr 29 '21 at 09:12
  • `$&` is a "replacement" (placeholder for something to be substituted in) for the full match (all the matched text) for those wondering – Løiten Sep 16 '21 at 14:02
  • Yep, none of these character lists (neither Austin Brunkhorst's nor c00000fd's) is correct. – Sasha Jan 17 '22 at 14:46