9

C++11 has 6 different regular expression grammars you can use. In my case, I am interacting with a component that is using modified ECMAScript regular expressions.

I need to create a regular expression "match a string starting with X", where X is a string literal I have.

So the regular expression I want is roughly ^X.*. Except the string X could contain more regular expression special characters, and I want them to occur.

Which means I really want ^ escaped(X) .*.

Now, I can read over the ECMAScript documentation, find all of the characters which have a special meaning, write a function that escapes them, and be done. But this seems inelegant, inefficient, and error prone -- especially if I want to support all 6 kinds of regular expressions that C++ supports currently, let alone in the future.

Is there a simple way in the standard to escape a literal string to embed in a C++ regular expression, possibly as a function of the regular expression grammar, or do I have to roll my own?

Here is a similar question using the boost library, where the list of escapes is hard-coded, and then a regular expression is generated that backslashes them. Am I reduced to adapting that answer for use in std?

Community
  • 1
  • 1
Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • The answer in [How to escape a string for use in Boost Regex](http://stackoverflow.com/questions/1252992/how-to-escape-a-string-for-use-in-boost-regex) is actually what you need. – Wiktor Stribiżew Aug 27 '15 at 14:08
  • Why do you need to escape it. If X is a string, can't you create your regex from just concatenating, like `"^" + X + ".*"`. – SamWhan Aug 27 '15 at 14:08
  • @stribizhev which means writing a custom version of it for each of the 6 regular expression formats and any new format that comes along. – Yakk - Adam Nevraumont Aug 27 '15 at 14:08
  • @GlasG Because `X` could be `"this.is.a.*******.problem"` -- and the `.` and `*` in that string should not be interpreted as regular expression commands. – Yakk - Adam Nevraumont Aug 27 '15 at 14:09
  • Yes, exactly. Boost regex won't do it for you. In .NET (e.g. C#, VB.NET), it is clear: use `Regex.Escape`. In C++, there is no such a function. – Wiktor Stribiżew Aug 27 '15 at 14:09
  • @Yakk Get it. Shooting from the hip here, but couldn't you do a simple replace of all non-alphanumeric character with their escaped form? Something like `regex_replace(X, "([^\\w\\d])", "\\\1")`, concatenate (as per my previous comment) then do your search. – SamWhan Aug 27 '15 at 14:22
  • @ClasG is `"\#"` interpreted as a `"#"` in every regular expression engine (at least the 6 supported by C++), or is it an error in some regular expression engine? I don't know personally. – Yakk - Adam Nevraumont Aug 27 '15 at 14:26
  • In [POSIX BRE](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html) `(` is not special, but `\(` is. So escape-everything-that looks-suspicious is not a good idea. – ex-bart Aug 27 '15 at 14:27
  • @Yakk As far as I know, yes. But instead of "quoting" non-alnum, put everything but `]` and `^` inside `[]` - `regex_replace(X, "([^\\]^])", "[\\1]")` – SamWhan Aug 27 '15 at 14:28
  • @Yakk: `#` is not a good example since it can be used to write comments. writing. I think amber answer is the best thing you can do (don't forget to add `#`) – Casimir et Hippolyte Aug 27 '15 at 14:28
  • @Yakk And... You'd have to quote `^` - `\\^` ;) – SamWhan Aug 27 '15 at 14:30
  • Boost docs say `three main syntax options` Perl (this is the default behavior), POSIX extended, POSIX Basic. Given there is only 3, I'd use Perl. With it you can embed `\Q .. \E` around your text. From the docs: `\Q begins a "quoted sequence": all the subsequent characters are treated as literals, until either the end of the regular expression or \E` –  Aug 27 '15 at 16:51
  • @Sln This is a C++11 question, not a boost question. Using boost regex to solve a C++11 regex problem seems silly. Finally, you still have to escape `\E`s in a more complex way with that option -- the string `X` I'm getting is provided by not-me, so I cannot sanitize it. – Yakk - Adam Nevraumont Aug 27 '15 at 17:15
  • @sln and if `str = "\\E.*\\Q"`, it looks to the perl RE engine that you ended the quoted string, put a `.*` in, then started it again. One would have to escape `\E` within `str` for your plan to work (probably by replacing it with `"\\E\\\\E\\Q"` or somesuch), *and* it only works for *one* kind of RE which is not the one I'm using. Finally, C++11 doesn't have perl based RE. I'm not "don't know for sure": I'm explaining why your comment isn't a viable approach. – Yakk - Adam Nevraumont Aug 27 '15 at 17:52
  • @Yakk - My mistake buddy, I thought C++11 used the boost::regex perl flavor. Unfortunately, you won't get lookbehinds with ECMAScript. I've compiled exe's with the boost regex source. It's about 20 .cpp files. Adds about 300k overhead, but you don't need the boost lib to run it when you do it this way. –  Aug 27 '15 at 21:25

2 Answers2

1

If you have to write your own, there is only two kinds you should need to know.
BRE and the rest.

These should work below. Use the ECMAScript type regex's to operate on the input string.

The below regexs' are formulated using the special characters from here:
What special characters must be escaped in regular expressions?
Under answer Legacy RegEx Flavors (BRE/ERE)

Both use the same replacement: "\\\\$1"

For BRE input:

 # "(\\\\[+?(){}|]|[.^$*\\[\\]\\\\-])"


 (                             # (1 start)
      \\ [+?(){}|]             # not sure this is needed (its not needed)
   |  
      [.^$*\[\]\\-] 
 )                             # (1 end)

For ERE or ECMAScript input:

 # "([.^$*+?()\\[\\]{}\\\\|-])"

 ( [.^$*+?()\[\]{}\\|-] )      # (1)

BRE input example:

Before -

+_)(*&^%$#@!asdfasfd hello
+ ? ( ) { } |
\+ \? \( \) \{ \} \|
\\+ \\? \\( \\) \\{ \\} \\|
}{":][';/.,<>? 
here is

After -

+_)(\*&\^%\$#@!asdfasfd hello
+ ? ( ) { } |
\\+ \\? \\( \\) \\{ \\} \\|
\\\\+ \\\\? \\\\( \\\\) \\\\{ \\\\} \\\\|
}{":\]\[';/\.,<>? 
here is
Community
  • 1
  • 1
1

(answering quite a while later, so probably OP has worked something out, but still).

A preliminary comment: The regular expression you'll want, in ECMAScript (and may other) syntaxes, is ^X, and you don't need the extra .* afterwards.

As for the approach to this task: You're asking for a general solution for all regex syntax options. Well, YAGNI - You ain't gonna need it. Unless you're writing a general-purpose library supposed to support all C++ regexp syntaxes, don't try to solve the whole world's problems yourself and right away. This is further emphasized by the fact that, since you wrote your question, additional regexp syntax options have been added to C++... so by C++17 it's, um, 10 I think. See here.

So I suggest you write something that is potentially extensible to other syntax options, but only actually works - for now - with the syntax option(s) you need. e.g.:

template <std::regex::syntax_option_type SyntaxOption>
std::string escape_for_regex(const std::string_view sv);

or perhaps

template <std::regex::syntax_option_type SyntaxOption>
std::string_view
escape_for_regex(
    const std::string_view source, 
    std::string_view destination
);

in which the returned string_view indicates how much of the destination you're actually using. One can bike-shed about the signature some more (e.g. perhaps use iterators? ranges?)

and you'll specialize this for std::regex::ECMAScript. The implementation is provided in this SO question:

Is there a RegExp.escape function in JavaScript?

with the answer being that there isn't, but you could add it like so (in Javascript mind you):

RegExp.escape = function(s) {
    return s.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
};

moving that to C++, and with our first option for the function signature, this becomes:

template <>
std::string escape_for_regex<std::regex::ECMAScript>(const std::string_view sv)
{
    const std::regex to_escape("[-/\\\\^$*+?.()|[\\]{}]");
    const std::string escaped("\\$1");
    const std::string s{sv};
    return std::regex_replace(s, to_escape, escaped);
}

Caveat: Haven't properly tested this. I also don't like the extra string construction, so probably another one of the regex_replace variants might be usable.

einpoklum
  • 118,144
  • 57
  • 340
  • 684