2

I'm using std::regex_replace in a C++ Windows project (Visual Studio 2010). The code looks like this:

std::string str("http://www.wikipedia.org/");
std::regex fromRegex("http://([^@:/]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::string fmt("https://$1wik$2.org/");
std::string result = std::regex_replace(str, fromRegex, fmt);

I would expect result to be "https://www.wikipedia.org/", but I get "https://www.wikipedia.wikipedia.org/".

A quick check with sed gives me the expected result

$ cat > test.txt
http://www.wikipedia.org/
$ sed 's/http:\/\/([^@:\/]+\.)?wik(ipedia|imedia)\.org\//https:\/\/$1wik$2.org\//' test.txt
http://www.wikipedia.org/

I don't get where the difference comes from. I checked the flags that can be used with std::regex_replace, I didn't see one that would help in this case.

Update

These variants work fine:

std::regex fromRegex("http://([^@:/]+\\.)wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://((?:[^@:/]+\\.)?)wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([a-z]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([^a]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);

bu not these:

std::regex fromRegex("http://([^1-9]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([^@]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);
std::regex fromRegex("http://([^:]+\\.)?wik(ipedia|imedia)\\.org/", std::regex_constants::icase);

It makes no sense to me...

Julien
  • 5,729
  • 4
  • 37
  • 60
  • 2
    What compiler are you using? If it's GCC give up: `` *is not implemented*. Use boost instead. – R. Martinho Fernandes Dec 20 '12 at 18:18
  • I wasn't aware that any compiler had regex support yet. Is this a recent update to a compiler? – Joseph Mansfield Dec 20 '12 at 18:20
  • 1
    @sftrabbit MSVC's STL has had a functional regex implementation since version 2008, and LLVM/Clang's libc++ has a complete regex implementation as well. – rubenvb Dec 20 '12 at 18:46
  • @rubenvb I clearly didn't research very well. – Joseph Mansfield Dec 20 '12 at 18:52
  • @R.MartinhoFernandes, why is that? Is boost's version so different that they couldn't just plug it in and call it a day? – Mark Ransom Dec 20 '12 at 20:45
  • 1
    @Mark here's an explanation, straight from the horse's mouth: http://stackoverflow.com/a/12665408/46642 – R. Martinho Fernandes Dec 20 '12 at 20:48
  • @R.MartinhoFernandes pretty decent explanation, but it doesn't explain two things: a) why are they running behind on all the competition? and b) why don't they just plug in Boost.Regex? But you're not the one I should be asking this, obviously, though I fear Jonathan won't like me pressing the issue `;-)` – rubenvb Dec 20 '12 at 21:17

1 Answers1

3

There's a subtle error in the regular expression. Don't forget that escape sequences in string literals are expanded by the compiler. So change

"http://([^@:/]+\.)?wik(ipedia|imedia)\.org/"

to

"http://([^@:/]+\\.)?wik(ipedia|imedia)\\.org/"

That is, replace each of the two single backslashes with a pair of backslashes.

EDIT: this doesn't seem to affect the problem, though. On the two implementations I tried (Microsoft and clang), the original problem doesn't occur, with our without the doubled backslashes. (Without, you get compiler warnings about an invalid escape sequence, but the resulting . wildcard matches the . character in the target sequence, just as a \. would)

Pete Becker
  • 74,985
  • 8
  • 76
  • 165
  • 1
    Or change it to use raw literals - which help with the leaning toothpick problem: R"http://([^@:/]+\.)?wik(ipedia|imedia)\.org/". Note the preceding R. – emsr Dec 20 '12 at 19:52
  • @emsr - sure, if you have a C++11 compiler that supports raw literals. – Pete Becker Dec 20 '12 at 19:53