1

I have some html code like

<tr class="class1">
    <td class="class2">
        <a href="some_address"></a>
        <div id="id1">
            <span class="class3"></span>
        </div>
        <span>Just a text</span>
    </td>
</tr>

I need to extract the piece of code between <tr class="class1"> and </tr> tags. I use this regular expression https://regex101.com/r/Z0Pmgg/1 . And it seems it works. But, when I am trying to use this expression in C++ STL it doesn't work at all :(

#include <string>
#include <regex>
#include <iostream>

int main()
{
    std::string str = "<tr class=\"class1\">\n"
                          "<td class=\"class2\">\n"
                              "<a href=\"some_address\"></a>\n"
                              "<div id=\"id1\">\n"
                                  "<span class=\"class3\"></span>\n"
                              "</div>\n"
                              "<span>Just a text</span>\n"
                          "</td>\n"
                      "</tr>\n";
    std::cmatch result;
    std::regex regular("(<tr class=\"class1\">)"
                       "([\s\S]*?)"
                       "(<\/tr>)");
    if (std::regex_match(str.c_str(), result, regular))
        std::cout << "Success\n" << result[2] << std::endl;
    return 0;
}

What am I doing wrong? I also tried to to use regex_search() instead

scohe001
  • 15,110
  • 2
  • 31
  • 51
elo2cx
  • 33
  • 3
  • `[\s\S]*?` => `[\\s\\S]*?` – Wiktor Stribiżew Aug 06 '20 at 20:10
  • What compiler and version? Early GCC regex implementations, for example, were more than a little wonky and GCC 4.8 keeps showing up like a zombie rising from the dead to plague the unsuspecting.. – user4581301 Aug 06 '20 at 20:13
  • Doesn't anyone read documentation? regex_match is wrong, regex_search is correct (for what you want to do). – john Aug 06 '20 at 20:18
  • 3
    Probably worth reading: https://stackoverflow.com/a/1732454/2602718 – scohe001 Aug 06 '20 at 20:25
  • I'm not certain, but I don't believe the construct `[\s\S]` is supported in C++, even when properly escaped – john Aug 06 '20 at 20:26
  • But obviously what you need to do is treat this like any other debugging task. Break the problem down into simpler and simpler pieces, until you find that part that isn't working. – john Aug 06 '20 at 20:28
  • 1
    @scohe001 I agree, do not use regex for HTML parsing. Get an HTML parser. – PaulMcKenzie Aug 06 '20 at 20:30
  • @john yes they should https://en.cppreference.com/w/cpp/regex/ecmascript see "Character classes" – Slava Aug 06 '20 at 21:22
  • @Slava I looked at that The `[]` construct is a *character class* containing zero or more *class ranges* while the `\s` construct is an *atom*. There doesn't seem to be any production that gets from *class range* to *atom*. – john Aug 07 '20 at 05:18
  • @john "The character class escapes are shorthands for some of the common characters classes, as follows:". So `\s` is a shorthand for `[[:space:]]` and `\S` is for `[^[:space:]]` – Slava Aug 07 '20 at 05:46
  • @Slava Yes I know, but we are talking about embedding a character class `\s` inside another character class `[]`. I don't see that the grammar allows that. – john Aug 07 '20 at 05:48
  • @john I see, sorry missed that. Then what that `[\s\S]` suppose to mean, either space or not space? So it is a synonym for `.`? But why? – Slava Aug 07 '20 at 05:51
  • @Slava In many regex languages `.` does not include a newline. So I guess that is the explanation. I think it's clear the OP has copied this regex from somewere else and tried to apply it to C++. The incorrect escaping seems to indicate that, `<\/tr>` would be typical in Perl for instance. – john Aug 07 '20 at 06:09
  • Thank you guys! I am not good with regular expressions yet. Your advices will be useful to me. I tried to run my code on VS2019 with MSVC I think, but I don't know version of it – elo2cx Aug 07 '20 at 14:56

1 Answers1

1

You need to escape the \ and take the final \n into account, or better yet, use regex_iterator instead of regex_match.

The following works for me in GCC 8, Clang 8 and MSVC 14:

#include <string>
#include <regex>
#include <iostream>

int main()
{
    std::string str = "<tr class=\"class1\">\n"
        "<td class=\"class2\">\n"
        "<a href=\"some_address\"></a>\n"
        "<div id=\"id1\">\n"
        "<span class=\"class3\"></span>\n"
        "</div>\n"
        "<span>Just a text</span>\n"
        "</td>\n"
        "</tr>\n";
    std::regex re("(<tr class=\"class1\">\\s*)"
        "([\\s\\S]*?)"
        "(\\s*</tr>\\s*)");

    for (std::sregex_iterator it{ str.begin(), str.end(), re }, end{}; it != end; it++) {
        std::smatch result = *it;
        std::cout << "Found:\n\n" << result[2] << "\n";
    }
}

Output:

Found:

<td class="class2">
<a href="some_address"></a>
<div id="id1">
<span class="class3"></span>
</div>
<span>Just a text</span>
</td>

Note: some old libstdc++ and libc++ implementations had difficulty understanding character classes inside a regex range [...]. In that case try replacing [\\s\\S] with (?:\\s|\\S) (or better yet, upgrade your libstdc++ to 6-4.9.1 or later).

rustyx
  • 80,671
  • 25
  • 200
  • 267