3

I can extract the four line string with this fragment using C++ std::regex:

  std::regex table("(<table id.*\n.*\n.*\n.*>)");
  const std::string format="$&";
  std::cout <<
     std::regex_replace(tidy_string(/* */)
        ,table
        ,format
        ,std::regex_constants::format_no_copy
        |std::regex_constants::format_first_only
        )
     << '\n';

tidy_string() returns a std::string and code produces this output:

<table id="creditPolicyTable" class=
                              "table table-striped table-condensed datatable top-bold-border bottom-border"
                              summary=
                              "This table of Credit Policy gives credit information (column headings) for list of exams (row headings).">

How do I match on text that has a varying number of lines rather than exactly four? For example:

<table id="creditPolicyTable" summary=
                              "This table of Credit Policy gives credit information (column headings) for list of exams (row headings).">

or:

<table id="creditPolicyTable"
    class="table table-striped table-condensed datatable top-bold-border bottom-border"
   summary="This table of Credit Policy gives credit information (column headings) for list of exams (row headings)."
 more="x"
 even_more="y">
CW Holeman II
  • 4,661
  • 7
  • 41
  • 72
  • You could possible just use `(]*?>)`. This would match everything until the first `>` and therefore give you the content of your `
    ` tab (assuming there are no escaped `>` characters inside). In general I think using regex to parse XML/HTML is not the best approach, have you considered using an XML parser instead (e.g. libxml2)?
    – ThePhysicist Aug 21 '17 at 11:45
  • Those later tags, do you mean to write something like "
    "?
    – AndyG Aug 21 '17 at 11:48
  • BTW the `.*` operators that you use above are "greedy", i.e. they try to match as many characters as possible. This could be a problem if you had a very long file with many "" tags inside.
    – ThePhysicist Aug 21 '17 at 11:53
  • i feel obliged to link to this great SO answer, and hope you find an alternate method of parsing xml data. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Haleemur Ali Aug 21 '17 at 13:41

1 Answers1

0

You should use std::regex_search and lazily search for anything but the '>' character. Like this:

#include <iostream>
#include <regex>

int main() {
  std::string lines[] = {"<table id=\"creditPolicyTable\" class=\"\
table table-striped -table-condensed datatable top-bold-border bottom-border\"\
summary=\
\"This table of Credit Policy gives credit information (column headings) for list of exams (row headings).\">",
               "<table id=\"creditPolicyTable\" summary=\
               \"This table of Credit Policy gives credit information (column headings) for list of exams (row headings).\"\
               more=\"x\"\
               even_more=\"y\">"};
  std::string result;
  std::smatch table_match;

  std::regex table_regex("<table\\sid=[^>]+?>");

  for (const auto& line : lines){
    if (std::regex_search(line, table_match, table_regex)) {
      for (size_t i = 0; i < table_match.size(); ++i)
        std::cout << "Match found " << table_match[i] << '\n';
    }
  }
}
Marc Lambrichs
  • 2,864
  • 2
  • 13
  • 14