1

I want to extract all URLs from a string. I found the perfect RegEx in this thread

Now I need help to iterate over all the matches. I also took a look into this example (on the bottom), but I just can't get it to work the way I want to

Basically I want to iterate over all the matches like in the second example and I also want to access the submatches like in the first example (5 & 8).

Currently I only get the first match. How can I get the rest?

unsigned counter = 0;
std::string urls = "www.google.de/test.php&id=2#anker stackoverflow www.test.com please work example.com/test";
std::regex word_regex(
        R"(^(([^:\/?#]+):)?(//([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)",
        std::regex::extended
);
auto words_begin = std::sregex_iterator(urls.begin(), urls.end(), word_regex);
auto words_end = std::sregex_iterator();

for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
    std::smatch match = *i;
    std::string match_str = match.str();
    for (const auto& res : match) {
        std::cout << counter++ << ": " << res << std::endl;
    }
    std::cout << "  " << match_str << '\n';
}

Output:

0: www.google.de/test.php&id=2#anker stackoverflow www.test.com please work example.com/test
1: 
2: 
3: 
4: 
5: www.google.de/test.php&id=2
6: 
7: 
8: #anker stackoverflow www.test.com please work example.com/test
9: anker stackoverflow www.test.com please work example.com/test
www.google.de/test.php&id=2#anker stackoverflow www.test.com please work example.com/test
Community
  • 1
  • 1
Julius
  • 102
  • 1
  • 9
  • Can you post the code you wrote so we can try to see where you are going wrong? – Galik Mar 24 '16 at 12:26
  • @Galik I added some code. I didn't add code in the first place because I am pretty sure this code is completely wrong. Thank you :) – Julius Mar 24 '16 at 12:49
  • I think that *regex* is designed to validate/extract parts of a `URL` that has already been extracted from a document. It looks like it will only work on one `URL` at a time. – Galik Mar 24 '16 at 12:56
  • What sort of document do the URLs come from? Is it `HTML`? – Galik Mar 24 '16 at 13:03
  • Yes, im building something like a crawler. Do you have a hint for me what I have to change? Thank you! – Julius Mar 24 '16 at 13:11
  • 1
    What I would do is make 2 regexes. One to extract the `URLs` from the `HTML` eg: `std::regex e_url(R"~(href=["']([^"']*)["'])~");` and a second one to validate/extract its component parts. – Galik Mar 24 '16 at 13:21
  • @Galik Thank you, works great now : – Julius Mar 24 '16 at 15:19

0 Answers0