Find URL in String

Question

hi im tring to find a URL in a string, i founded many topics about this using regex but i have a problem. Using this pattern:

String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

Its works pretty well in most of pages, but i have an issue with other. For example:

http://hello.com/hello world

returns

http://hello.com/hello

The problems is that space.

Anyone have a nice pattern that solve this?

Thanks.

EDIT:: this is my code

private ArrayList<String> pullLinks(String text) {
    ArrayList<String> links = new ArrayList<String>();

    String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(text);
    while(m.find()) {
    String urlStr = m.group();
    if (urlStr.startsWith("(") && urlStr.endsWith(")"))
    {
    urlStr = urlStr.substring(1, urlStr.length() - 1);
    }
    links.add(urlStr);
    }
    return links;
    }

Offtopic: There are more top level TLDs with more that 2 letters, that those you have listed. Check [Wikipedia list of TLDs](http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains). Also your regexp will miss URLs written like this: `example.com`. — ShaMan-H_Fel, Mar 16 '12 at 13:10
Offtopic, but here's a good pattern for matching URLs, explained row by row: http://daringfireball.net/2010/07/improved_regex_for_matching_urls — Holm, Mar 16 '12 at 13:18

score 4 · Accepted Answer · edited May 23 '17 at 12:11

4

Spaces are not allowed in URLs (they need to be replaced by %20). See for instance the answer to this question:

Spaces in URLs?

If you allow URLs to include spaces anyway, then how would you interpret for instance http://www.google.com/ig is a nice webpage? Clearly the part after /ig should not be included!

edited May 23 '17 at 12:11

Community

1
1

answered Mar 16 '12 at 13:01

aioobe

413,195
112
811
826

so there isn't any way to detect Urls with %20? – Alexx Perez Mar 16 '12 at 13:38
Sure there is. The expression you have already does. Look for instance at `%[a-f\\d]{2}` (means `%` followed by `{2}` characters that are in range `a-f` or are digits). – aioobe Mar 16 '12 at 13:42
this is not working for me. Edit question with my code. thanks – Alexx Perez Mar 16 '12 at 13:55
(http://hello.com/hello world) returns http://hello.com/hello should return http://hello.com/hello%20world – Alexx Perez Mar 16 '12 at 14:05
@AlexxPerez, what should ("google.com/ig is a nice website") return? – aioobe Mar 16 '12 at 14:09
sry iknow what you mean. but my URL always finish with .jpg ("http://google.com/ig is a nice website.jpg") return http://google.com/ig%20is%20a%20nice%20website.jpg sry – Alexx Perez Mar 16 '12 at 14:13
1

Then adding the following to your expression: `regex = "(" + regex + ").*?\\.jpg"` to your regular expression. – aioobe Mar 16 '12 at 14:15
just a thing more please. how can i select until the first .jpg?? – Alexx Perez Apr 03 '12 at 11:55

score 0 · Answer 2 · answered Mar 16 '12 at 13:03

0

Space is not a valid URL character.

Also, if you don't use whitespace as your terminator how are you going to find the end of the URL?

Your regex is also failing to account for other top level domains (like .int). I'm not actually sure why it is looking for specific TLDs at all as they are not required to form a valid URL.

answered Mar 16 '12 at 13:03

Dev

11,919
3
40
53

its not a problem for me that fails with .int or others. my URL always will be: http://something.es/some some.jpg – Alexx Perez Mar 16 '12 at 13:39

Find URL in String

2 Answers2