RegEx to find all matches of URLs in text

Question

I want to extract all URLs from a string. I found the perfect RegEx in this thread

Now I need help to iterate over all the matches. I also took a look into this example (on the bottom), but I just can't get it to work the way I want to

Basically I want to iterate over all the matches like in the second example and I also want to access the submatches like in the first example (5 & 8).

Currently I only get the first match. How can I get the rest?

unsigned counter = 0;
std::string urls = "www.google.de/test.php&id=2#anker stackoverflow www.test.com please work example.com/test";
std::regex word_regex(
        R"(^(([^:\/?#]+):)?(//([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)",
        std::regex::extended
);
auto words_begin = std::sregex_iterator(urls.begin(), urls.end(), word_regex);
auto words_end = std::sregex_iterator();

for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
    std::smatch match = *i;
    std::string match_str = match.str();
    for (const auto& res : match) {
        std::cout << counter++ << ": " << res << std::endl;
    }
    std::cout << "  " << match_str << '\n';
}

Output:

0: www.google.de/test.php&id=2#anker stackoverflow www.test.com please work example.com/test
1: 
2: 
3: 
4: 
5: www.google.de/test.php&id=2
6: 
7: 
8: #anker stackoverflow www.test.com please work example.com/test
9: anker stackoverflow www.test.com please work example.com/test
www.google.de/test.php&id=2#anker stackoverflow www.test.com please work example.com/test

Can you post the code you wrote so we can try to see where you are going wrong? — Galik, Mar 24 '16 at 12:26
@Galik I added some code. I didn't add code in the first place because I am pretty sure this code is completely wrong. Thank you :) — Julius, Mar 24 '16 at 12:49
I think that *regex* is designed to validate/extract parts of a `URL` that has already been extracted from a document. It looks like it will only work on one `URL` at a time. — Galik, Mar 24 '16 at 12:56
Yes, im building something like a crawler. Do you have a hint for me what I have to change? Thank you! — Julius, Mar 24 '16 at 13:11
What I would do is make 2 regexes. One to extract the `URLs` from the `HTML` eg: `std::regex e_url(R"~(href=["']([^"']*)["'])~");` and a second one to validate/extract its component parts. — Galik, Mar 24 '16 at 13:21

RegEx to find all matches of URLs in text

0 Answers0