CPP + Regular Expression to Validate URL

Question

I want to build a regular expression in c++{MFC} which validates the URL.

The regular expression must satisfy following conditions.

Valid URL:- http://cu-241.dell-tech.co.in/MyWebSite/ISAPIWEBSITE/Denypage.aspx/ http://www.google.com http://www.google.co.in

Invalid URL:-

http://cu-241.dell-tech.co.in/\MyWebSite/\ISAPIWEBSITE/\Denypage.aspx/ = Regx must check & invalid URL as '\' character between "/\MyWebSite/\ISAPIWEBSITE/\Denypage.aspx/"
http://cu-241.dell-tech.co.in//////MyWebSite/ISAPIWEBSITE/Denypage.aspx/ = Regx must check & invalidate URL due to multiple entries of "///////" in url.
http://news.google.co.in/%5Cnwshp?hl=en&tab=wn = Regex must check & invalidate URL for additional insertion of %5C & %2F character.

How can we develop a generic Regular Expression satisfying above condition. Please, Help us by providing a regular expression that will handle above scenario's in CPP{MFC}

score 13 · Answer 1 · edited Oct 07 '21 at 07:59

13

Have you tried using the RFC 3986 suggestion? If you're capable of using GCC-4.9 then you can go directly with <regex>.

It states that with ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? you can get as submatches:

  scheme    = $2
  authority = $4
  path      = $5
  query     = $7
  fragment  = $9

For example:

int main(int argc, char *argv[])
{
  std::string url (argv[1]);
  unsigned counter = 0;

  std::regex url_regex (
    R"(^(([^:\/?#]+):)?(//([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?)",
    std::regex::extended
  );
  std::smatch url_match_result;

  std::cout << "Checking: " << url << std::endl;

  if (std::regex_match(url, url_match_result, url_regex)) {
    for (const auto& res : url_match_result) {
      std::cout << counter++ << ": " << res << std::endl;
    }
  } else {
    std::cerr << "Malformed url." << std::endl;
  }

  return EXIT_SUCCESS;
}

Then:

./url-matcher http://localhost.com/path\?hue\=br\#cool

Checking: http://localhost.com/path?hue=br#cool
0: http://localhost.com/path?hue=br#cool
1: http:
2: http
3: //localhost.com
4: localhost.com
5: /path
6: ?hue=br
7: hue=br
8: #cool
9: cool

edited Oct 07 '21 at 07:59

Community

1
1

answered Jul 24 '15 at 14:33

Ciro Costa

2,455
22
25

This is working really great. Could you show me how to use this to extract all matching urls in a string using the regex? I tried to use it with sregex_iterator but I don't get any matches. Thank you very much! – Julius Feb 28 '16 at 10:31
7

Unforutnately this is not for validating but for splitting a correct URI into its parts. It will not even detect the most simple cases, like unencoded spaces. – Lothar Jul 27 '16 at 03:27
Thanks for such a useful and well-explained answer. This is the best all-around URL parsing script I've found for accuracy, ease of use, and quickness of implementation. And you don't need to download any special libraries! It would make a great answer to this question: https://stackoverflow.com/q/2616011/1043704 – Lorien Brune Sep 30 '17 at 03:19
Note that if a port is present, it will be included in elements 3 and 4. I.e., `http://localhost.com:8888/path?hue=br#cool` results in `3: //localhost.com:8888` and `4: localhost.com:8888`. – Lorien Brune Sep 30 '17 at 03:49
not working, "aaa" will be accepted by this expression – Andrey Nekrasov Dec 02 '20 at 12:52

score 0 · Answer 2 · answered Apr 11 '11 at 11:14

0

look at http://gskinner.com/RegExr/, there is a community tab on the right where you find contributed regex's. There is a URI category, not sure you'll find exactly what you need but this is a good start

answered Apr 11 '11 at 11:14

davka

13,974
11
61
86

Daniel · Answer 3 · 2023-02-20T17:36:36.130

With the following regex you can filter out simply most of the incorrect URLs:

int main(int argc, char* argv[]) {
    std::string url(argv[1]);
    std::regex urlRegex(R"(^https?://[0-9a-z\.-]+(:[1-9][0-9]*)?(/[^\s]*)*$)");

    if (!std::regex_match(value, urlRegex)) {
        throw Poco::InvalidArgumentException(
            "Malformed URL: \"" + value + "\". "
            "The URL must start with http:// or https://, "
            "the domain name should only contain lowercase alphanumeric characters, '.' and '-', "
            "the port should not start with 0, "
            "and the URL should not contain any whitespace.");
    }
}

It checks if the URL starts with http:// or https://, whether the domain name is only lowercase alphanumeric characters with '.' and '-', checks that the port is not starting with 0 (e.g. 0123), and allows for any port number and any path/query string that does not contain whitespace.

But to be absolutely sure that the URL is valid, you're probably better off parsing the URL. I would not recommend trying to cover all scenarios with regex (including the correctness of paths, queries, fragments), because it would be pretty difficult.

CPP + Regular Expression to Validate URL

3 Answers3

Linked