3

I've found that page: https://mathiasbynens.be/demo/url-regex where different regular expressions for URL validation and their possibilities are nicely listed. Diego Perini's regex is the most powerful one and I would like to use it in Java. However it doesn't work if I use it that way:

public class URLValidation {
    // "\" replaced by "\\"
    private static Pattern REGEX = Pattern.compile("_^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!10(?:\\.\\d{1,3}){3})(?!127(?:\\.\\d{1,3}){3})(?!169\\.254(?:\\.\\d{1,3}){2})(?!192\\.168(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}]{2,})))(?::\\d{2,5})?(?:/[^\\s]*)?$_iuS");

    private static String[] URLs = new String[] { "http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.bar?q=Spaces should be encoded" };

    public static void main(String[] args) throws Exception {
        for (String url : URLs) {
            Matcher matcher = REGEX.matcher(url);
            if (matcher.find()) {
                System.out.println(matcher.group());
            }}}}

This code outputs nothing, however it should output the first three URLs in the array. How to compile the regex properly to get the code working?

upd: Thanks for the proposals. I tested your regexes in the real application. What I do there is iterate through log files and look for URL in each line. A log files have timestamps and usernames enclosed in [] and <> respectively and sometimes can contain special insivible characters responsible for formatting (color, boldness, etc) like \u0003. The regex seems to have problem with that type of strings: http://ideone.com/WEcgBY

upd2: And how about a regex finding all URLs in a line if it contains several? For example to use it like this:

String[] urlsFromLine = REGEX.split(line);
for (String url : urlsFromLine) {
    System.out.println(url);
}
Danny Lo
  • 1,553
  • 4
  • 26
  • 48

1 Answers1

4

Use this version:

"(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$"

You did not have to double the slashes, add regex delimiters, modifiers at the end of the pattern, and turn \u to \x notation.

See IDEONE demo:

String[] URLs = new String[] { "http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.bar?q=Spaces should be encoded" };
Pattern REGEX = Pattern.compile("(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$");
for (String url : URLs) {
    Matcher matcher = REGEX.matcher(url);
    if (matcher.find()) {
       System.out.println(matcher.group());
    }
}

Output:

http://foo.com/blah_blah
http://foo.com/blah_blah/
http://foo.com/blah_blah_(wikipedia)

UPDATE

To match URLs in larger texts, you need to replace ^ and $ with \\b:

Pattern REGEX = Pattern.compile("(?i)\\b(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?\\b");

See another demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • That means you need to adapt this regex to match URLs inside larger strings. You need to replace `^` and `$` with `\\b`, a word boundary. – Wiktor Stribiżew Jul 15 '15 at 21:52
  • This is IDEONE who replaces real URLs with placeholders. I'll give a word boundary a try. – Danny Lo Jul 15 '15 at 21:55
  • I have the next requirement for you :) – Danny Lo Jul 15 '15 at 22:05
  • Do not use `split`, it just does not work in this case. – Wiktor Stribiżew Jul 15 '15 at 22:09
  • Ahh. But I could split a string using "\\s" and then evaluate the resulting strings with the monster regex, right? – Danny Lo Jul 15 '15 at 22:44
  • Yes, you can split and check, but I do not see the point since you have the regex to match URL substrings in a larger string. Just use `while (matcher.find()) {...}`. – Wiktor Stribiżew Jul 15 '15 at 23:02
  • I didn't know `find()` can be used in a loop to continiously search in a string. Thanks again for the help! – Danny Lo Jul 15 '15 at 23:33
  • This regex is mostly good, but produces false positives, e.g. `http://a.b--c.de/` and `http://www.foo.bar./` – SpaceBison Sep 18 '19 at 09:12
  • @SpaceBison If you can list additional requirements, the pattern can be improved. E.g. You may add `(?![^/]*--)` after `(?:(?:https?|ftp)://)`, see the [regex demo](https://regex101.com/r/PY0uhU/1) (converted to PCRE pattern). – Wiktor Stribiżew Sep 18 '19 at 09:19
  • @WiktorStribiżew I ran tests against the list found [here](https://mathiasbynens.be/demo/url-regex) and these were the cases in which the regex failed. That being said, this one is so far the best one I found. :) – SpaceBison Sep 18 '19 at 10:41
  • @SpaceBison I see. If you add `(?![^/]*--)(?![^/]*\./)` after `(?:(?:https?|ftp)://)` you will get the best expression so far for that validation list. See [demo](https://regex101.com/r/PY0uhU/2). Note that I do not think it is possible for a regex to handle (here, reject) URLs that contain spaces when extracting them from a larger text. Regex does not "know" if the next word is part of URL or normal text. – Wiktor Stribiżew Sep 18 '19 at 10:48