C++ RegEx out of memory

Question

I am using regex to retrieve a string from between divs in a html page however I have run into a out of memory error. I am using Visual Studio 2012 and C++.

The regex expression is "class=\"ListingDescription\">((.*|\r|\n)*?(?=</div>))" and regxbuddy reckons it does it in 242 steps (much better than ~5000 it had originally). The website I am trying to scrap the info from is http://www.trademe.co.nz/Browse/Listing.aspx?id=557211466

Here is the code:

typedef match_results<const char*> cmatch;
tr1::cmatch results;
try {
    tr1::regex regx("class=\"ListingDescription\">((.*|\\r|\\n)*?(?=</div>))");

    tr1::regex_search(data.c_str(), results, regx);

        cout << result[1];

} 
catch (const std::regex_error& e) {
    std::cout << "regex_error caught: " << e.what() << '\n';
    if (e.code() == std::regex_constants::error_brack) {
        std::cout << "The code was error_brack\n";
       }
}

This is the error I get:

regex_error caught: regex_error(error_stack): There was insufficient memory to d
etermine whether the regular expression could match the specified character sequ
ence.

Regexbuddy works fine and so do some online regex tools just not my code :( Please help

If you're using VS2012 why not use the regular `` instead of `` — Rapptz, Feb 10 '13 at 11:29
Probably not useful towards your problem, but is `(.*|\\r|\\n)` any different than `.*`? — mah, Feb 10 '13 at 11:48
@mah Yes, `.` matches all characters except newline characters. — Clement Bellot, Feb 10 '13 at 12:01
Frequently, when grabbing info from inside html it will be easier using a DOM parser. — ninMonkey, Feb 11 '13 at 04:13

score 2 · Accepted Answer · edited May 23 '17 at 12:18

2

You are using a . at a place where it can happen multiple times, so it will match all <, including the one before </div>, which is something you probably do not want.

And now the mandatory link RegEx match open tags except XHTML self-contained tags .

Using regexp to parse HTML is generally a bad idea. You should use an HTML parser instead

edited May 23 '17 at 12:18

Community

1
1

answered Feb 10 '13 at 11:32

Clement Bellot

841
1
7
19

I see now. Regex is pretty limited in some areas. I will have a look at parsers and try them out. What I have done in the mean time is: ' std::string startstr = "
"; unsigned startpos = data.find(startstr) + strlen(startstr.c_str()); unsigned endpos = data.find("
", startpos); std::string desc = data.substr (startpos,endpos - startpos); ' – user2058629 Feb 11 '13 at 03:15

user2058629 · Answer 2 · 2013-02-11T03:32:59.880

I see now. Regex is pretty limited in some areas. I will have a look at parsers and try them out. What I have done in the mean time is:

std::string startstr = "<div id=\"ListingDescription_ListingDescription\" class=\"ListingDescription\">";
unsigned startpos = data.find(startstr) + strlen(startstr.c_str()); 
unsigned endpos = data.find("</div>",
startpos); 
std::string desc = data.substr (startpos,endpos - startpos);

LOL, i know its not great but it works.

Thanks Clement Bellot

C++ RegEx out of memory

2 Answers2