8

The following outputs ">Hut" where I expect it to output "Hut". I know that .* is greedy but > must be matched and it is outside of the capture group so why is it in my submatch?

#include <string>
#include <regex>
#include <iostream>

using namespace std;

int main() {
        regex my_r(".*>(.*)");
        string temp(R"~(cols="64">Hut)~");
        smatch m;
        if (regex_match(temp, m, my_r)) {
                cout << m[1] << endl;
        }
}
Xu Wang
  • 10,199
  • 6
  • 44
  • 78
  • note that regex implementation support is still very low on gcc and MSVC probably, too. – Stephan Dollberg Jun 05 '12 at 06:54
  • I upgraded to g++ 4.7, but still same output. I still think this is a misunderstanding of regexes on my part. Too often have I blamed software for my own errors in the past. – Xu Wang Jun 05 '12 at 07:24
  • The regex is good. Try escaping > like **\>**, this is just guess. Also the initial .* isn't required just use **>(.+)** – tuxuday Jun 05 '12 at 08:00
  • related: http://stackoverflow.com/questions/8060025/is-this-c11-regex-error-me-or-the-compiler – jfs Jun 05 '12 at 08:26
  • @tuxuday, `>` has no special meaning, but in some flavors `\>` is an end-of-word boundary. Best to leave it as it is. – Alan Moore Jun 05 '12 at 08:57

2 Answers2

7

This is a bug in libstdc++'s implementation. Watch these:

#include <string>
#include <regex>
#include <boost/regex.hpp>
#include <iostream>

int main() {
    {
        using namespace std;
        regex my_r("(.*)(6)(.*)");
        smatch m;
        if (regex_match(std::string{"123456789"}, m, my_r)) {
            std::cout << m.length(1) << ", "
                      << m.length(2) << ", "
                      << m.length(3) << std::endl;
        }
    }

    {
        using namespace boost;
        regex my_r("(.*)(6)(.*)");
        smatch m;
        if (regex_match(std::string{"123456789"}, m, my_r)) {
            std::cout << m.length(1) << ", "
                      << m.length(2) << ", "
                      << m.length(3) << std::endl;

        }
    }

    return 0;
}

If you compile with gcc, the first one (libstdc++) returns the totally wrong result 9, -2, 4 and the second one (boost's implementation) returns 5, 1, 3 as expected.

If you compile with clang + libc++, your code works fine.

(Note that libstdc++'s regex implementation is only "partially supported", as described in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52719.)

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • Oh my, that’s singularly annoying. Any chance of choosing another syntax option? Not that I’d *want* something other than ECMA-Script … but if that doesn’t work … (incidentally, I’ve now started wondering why they didn’t go with PCRE). – Konrad Rudolph Jun 05 '12 at 08:24
  • By the way, the bug still exists in GCC 4.7. – Konrad Rudolph Jun 05 '12 at 08:28
  • thank you for the examples and explanations. I guess it's not fair of me to expect much if it is only partially supported. I'll either use boost or avoid regexes for the time being. – Xu Wang Jun 05 '12 at 08:38
  • @KonradRudolph: It's not related to ECMAScript. `regex my_r("(.*)(6)(.*)", regex::extended)` still have the same bug. – kennytm Jun 05 '12 at 09:59
  • Ah, rats. I thought the *engines* were pluggable but it looks like it’s only the parser. – Konrad Rudolph Jun 05 '12 at 12:09
3

You can modify your regular expression so that matched parts are divided into groups:

std::regex my_r("(.*)>(.*)\\).*"); // group1>group2).*
std::string temp("~(cols=\"64\">Hut)~");
std::sregex_iterator reg_it(temp.begin(), temp.end(), my_r);

if (reg_it->size() > 1) {
    std::cout
        << "1: " << reg_it->str(1) << std::endl  // group1 match
        << "2: " << reg_it->str(2) << std::endl; // group2 match
}

outputs:

1: ~(cols="64"
2: Hut

Note that groups are specified by bracets ( /* your regex here */ ) and if you want to make a bracet part of your expression, then you need to escape it with \, which is \\ in code. For more information see Grouping Constructs.

This question can also help you: How do I loop through results from std::regex_search?

Also don't use using namespace std; at the beginning of your files, it's a bad practice.

Community
  • 1
  • 1
LihO
  • 41,190
  • 11
  • 99
  • 167
  • Thank you for your answer and for your tip regarding `using namespace std;`. I appreciate the explanations! – Xu Wang Jun 05 '12 at 08:37