-1

I have a string test

<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>

I want to find <a href="4.%20Functions,%20scope.ppt"> (as a substring)

As a search with Dr.Google: regex e ("<a href=.*?>"); cmatch =cm; to mark substring that I want to find.

How I can do next?

Am I right to use regex_match(htmlString, cm, e); with htmlString as wchar_t*

kgf3JfUtW
  • 13,702
  • 10
  • 57
  • 80

2 Answers2

2

If you want to find all the matching substrings then you need to use the regex iterators:

// example data
std::wstring const html = LR"(

<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>
<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>
<td><a href="4.%20Functions,%20scope.ppt">4. Functions, scope.ppt</a></td>

)";

// for convenience
constexpr auto fast_n_loose = std::regex_constants::optimize|std::regex_constants::icase;

// extract href's
std::wregex const e_link{LR"~(href=(["'])(.*?)\1)~", fast_n_loose};

int main()
{
    // regex iterators       
    std::wsregex_iterator itr_end;
    std::wsregex_iterator itr{std::begin(html), std::end(html), e_link};

    // iterate through the matches
    for(; itr != itr_end; ++itr)
    {
        std::wcout << itr->str(2) << L'\n';
    }
}
Galik
  • 47,303
  • 4
  • 80
  • 117
  • This seems to be exactly what the OP is looking for if I understand the question +1. – Justin Randall Dec 21 '17 at 17:13
  • Can I use `string` instead `wstring`? – BUI CHAU Minh Tung Dec 21 '17 at 20:05
  • @BUICHAUMinhTung Your question mentions `wchar_t*` for the data so you really need to use `std::wstring` for that. But yes you can do all this with `std::string`, `std::regex` and `std::sregex_iterator` if you don't need to process multibyte characters. – Galik Dec 21 '17 at 20:57
  • @BUICHAUMinhTung If your source data is `UTF-8` in `std::string` then you will need to convert into wide character unicode as in this answer https://stackoverflow.com/questions/37989081/how-to-use-unicode-range-in-c-regex/37990517#37990517 example conversion functions can be found in this answer here: https://stackoverflow.com/questions/43302279/any-good-solutions-for-c-string-code-point-and-code-unit/43302460#43302460 – Galik Dec 21 '17 at 21:07
1

This will match the complete a tag and also get the href attribute value,
which is in capture group 2.

It should be done this way because the href attribute can be anywhere in the tag.

<a(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])([\S\s]*?)\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

You can substitute [\w:}+ in place of the a tag to get the href from all tags.

https://regex101.com/r/LHZXUM/1

Formatted and tested

 < a                    # a tag, substitute [\w:]+ for any tag

 (?=                    # Asserttion (a pseudo atomic group)
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s href \s* = \s* 
      (?:
           ( ['"] )               # (1), Quote
           ( [\S\s]*? )           # (2), href value
           \1 
      )
 )
 \s+ 
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >