29

I'm just getting my head around regular expressions, and I'm using the Boost Regex library.

I have a need to use a regex that includes a specific URL, and it chokes because obviously there are characters in the URL that are reserved for regex and need to be escaped.

Is there any function or method in the Boost library to escape a string for this kind of usage? I know there are such methods in most other regex implementations, but I don't see one in Boost.

Alternatively, is there a list of all characters that would need to be escaped?

Gerald
  • 23,011
  • 10
  • 73
  • 102

4 Answers4

40
. ^ $ | ( ) [ ] { } * + ? \

Ironically, you could use a regex to escape your URL so that it can be inserted into a regex.

const boost::regex esc("[.^$|()\\[\\]{}*+?\\\\]");
const std::string rep("\\\\&");
std::string result = regex_replace(url_to_escape, esc, rep,
                                   boost::match_default | boost::format_sed);

(The flag boost::format_sed specifies to use the replacement string format of sed. In sed, an escape & will output whatever matched by the whole expression)

Or if you are not comfortable with sed's replacement string format, just change the flag to boost::format_perl, and you can use the familiar $& to refer to whatever matched by the whole expression.

const std::string rep("\\\\$&");
std::string result = regex_replace(url_to_escape, esc, rep,
                                   boost::match_default | boost::format_perl);
LogicStuff
  • 19,397
  • 6
  • 54
  • 74
Amber
  • 507,862
  • 82
  • 626
  • 550
  • I tried using a regex to do it, but I'm still fairly incompetent, and strange things were occuring :p I've ordered a couple of books on regex today so hopefully my ignorance will be short lived! In the meantime, using a regular string replacement to escape these characters worked for my immediate needs, thanks. – Gerald Aug 10 '09 at 03:43
  • I added some code to my answer that I *think* should work to add a backslash before any character that needs to be escaped. I haven't used boost in a while though so no guarantees. – Amber Aug 10 '09 at 03:50
  • 7
    It was close, just had to add a "&" to the end of rep and it worked. Thanks. – Gerald Aug 10 '09 at 04:00
  • Btw, Since C++11 we could also use std::regex. Unfortunately, GCC4.8 has many regex bugs. And indeed, even with GCC7, the SED expression does not work correctly. This was fixed for GCC8: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83601 – dhaumann Sep 17 '18 at 12:05
14

Using code from Dav (+ a fix from comments), I created ASCII/Unicode function regex_escape():

std::wstring regex_escape(const std::wstring& string_to_escape) {
    static const boost::wregex re_boostRegexEscape( _T("[.^$|()\\[\\]{}*+?\\\\]") );
    const std::wstring rep( _T("\\\\&") );
    std::wstring result = regex_replace(string_to_escape, re_boostRegexEscape, rep, boost::match_default | boost::format_sed);
    return result;
}

For ASCII version, use std::string/boost::regex instead of std::wstring/boost::wregex.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Nishi
  • 10,634
  • 3
  • 27
  • 36
4

Same with boost::xpressive:

const boost::xpressive::sregex re_escape_text = boost::xpressive::sregex::compile("([\\^\\.\\$\\|\\(\\)\\[\\]\\*\\+\\?\\/\\\\])");

std::string regex_escape(std::string text){
    text = boost::xpressive::regex_replace( text, re_escape_text, std::string("\\$1") );
    return text;
}
Taryn
  • 242,637
  • 56
  • 362
  • 405
Roman
  • 1
  • 1
1

In C++11, you can use raw string literals to avoid escaping the regex string:

std::string myRegex = R"(something\.com)";

See http://en.cppreference.com/w/cpp/language/string_literal, item (6).

Emile Cormier
  • 28,391
  • 15
  • 94
  • 122