1

hi im tring to find a URL in a string, i founded many topics about this using regex but i have a problem. Using this pattern:

String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

Its works pretty well in most of pages, but i have an issue with other. For example:

http://hello.com/hello world

returns

http://hello.com/hello

The problems is that space.

Anyone have a nice pattern that solve this?

Thanks.

EDIT:: this is my code

private ArrayList<String> pullLinks(String text) {
    ArrayList<String> links = new ArrayList<String>();

    String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(text);
    while(m.find()) {
    String urlStr = m.group();
    if (urlStr.startsWith("(") && urlStr.endsWith(")"))
    {
    urlStr = urlStr.substring(1, urlStr.length() - 1);
    }
    links.add(urlStr);
    }
    return links;
    }  
Alexx Perez
  • 215
  • 2
  • 10
  • 19
  • Offtopic: There are more top level TLDs with more that 2 letters, that those you have listed. Check [Wikipedia list of TLDs](http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains). Also your regexp will miss URLs written like this: `example.com`. – ShaMan-H_Fel Mar 16 '12 at 13:10
  • Offtopic, but here's a good pattern for matching URLs, explained row by row: http://daringfireball.net/2010/07/improved_regex_for_matching_urls – Holm Mar 16 '12 at 13:18

2 Answers2

4

Spaces are not allowed in URLs (they need to be replaced by %20). See for instance the answer to this question:

If you allow URLs to include spaces anyway, then how would you interpret for instance http://www.google.com/ig is a nice webpage? Clearly the part after /ig should not be included!

Community
  • 1
  • 1
aioobe
  • 413,195
  • 112
  • 811
  • 826
0

Space is not a valid URL character.

Also, if you don't use whitespace as your terminator how are you going to find the end of the URL?

Your regex is also failing to account for other top level domains (like .int). I'm not actually sure why it is looking for specific TLDs at all as they are not required to form a valid URL.

Dev
  • 11,919
  • 3
  • 40
  • 53
  • its not a problem for me that fails with .int or others. my URL always will be: http://something.es/some some.jpg – Alexx Perez Mar 16 '12 at 13:39