2

Greetings all. I am using the following regex to detect urls in a string and wrap them inside the < a > tag

public static String detectUrls(String text) {

        String newText = text
                .replaceAll("(?:https?|ftps?|http?)://[\\w/%.-?&=]+",
                        "<a href='$0'>$0</a>").replaceAll(
                        "(www\\.)[\\w/%.-?&=]+", "<a href='http://$0'>$0</a>");
        return newText;
    }

i have a problem that the following links are not detected correctly: i am not that good with regex, so please advise.

http://code.google.com/p/shindig-dnd/

http://confluence.atlassian.com/display/GADGETDEV/Gadgets+and+JIRA+Portlets

www.liferay.com/web/raymond.auge/blog/

(www.opensocial.org/)

http://www.google.com

Mahmoud Saleh
  • 33,303
  • 119
  • 337
  • 498
  • 1
    Checkout http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url – ismail Dec 23 '10 at 13:31

3 Answers3

3

I'm using this:

private static final String URL_REGEX = 
   "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";

Matcher matcher = URL_PATTERN.matcher(text);
text = matcher.replaceAll("<a href=\"$0\">$0</a>");
return text;
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
  • 1
    Declare `&` instead of `&` would suffice because `a`, `m` and `p` are already in the range `a-z` and `;` is delared twice. – Toto Dec 23 '10 at 14:06
  • this pattern works fine for most cases, but didn't catch this case: (www.opensocial.org) – Mahmoud Saleh Dec 26 '10 at 09:23
2

The problem you have is that you are using - within a character group ([]) without escaping it, which is being used to define the range .-? (i.e. the characters ./0123456789:;<=>?). Either escape it \\- or put it at the end of the character class so that it doesn't complete a range.

public static String detectUrls(String text) {
    String newText = text
            .replaceAll("(?:https?|ftps?|http?)://[\\w/%.\\-?&=]+",
                    "<a href='$0'>$0</a>").replaceAll(
                    "(www\\.)[\\w/%.\\-?&=]+", "<a href='http://$0'>$0</a>");
    return newText;
}
moinudin
  • 134,091
  • 45
  • 190
  • 216
  • @marcog: there's actually one pattern that's still not catched: something like http: //www.google.com – Mahmoud Saleh Dec 27 '10 at 12:30
  • @sword Is that space after `http:` a typo? – moinudin Dec 27 '10 at 12:40
  • @marcog,yes i meant to add it coz without the space the editor will convert it to http://www.google.com so i add this to skip the editor formatting, and you know what i want to say right ? – Mahmoud Saleh Dec 27 '10 at 16:07
  • @marcog, what do you suggest ? – Mahmoud Saleh Dec 27 '10 at 16:33
  • @sword Swap the `replaceAll()` calls around and use [negative lookbehind](http://www.regular-expressions.info/lookaround.html). Here's it working: [http://ideone.com/Dj6ew](http://ideone.com/Dj6ew) with one minor issue - it also adds `http://` in front of the displayed URL. This is a limitation of regular expressions, and to fix it you'll have to parse the text in one pass without regular expressions. – moinudin Dec 27 '10 at 16:39
  • @marcog, thanks a lot for your feedback, still it didn't catch the following url correctly http://www.listenarabic.com/ar/Music+Egypt/ any ideas ? – Mahmoud Saleh Jan 03 '11 at 12:51
  • @sword That's because your character set doesn't include a `+`. Add it int the []'s. – moinudin Jan 03 '11 at 12:52
  • @marcog, you are righ, i added it still this pattern is not catched https://www.google.com – Mahmoud Saleh Jan 03 '11 at 13:24
  • @sword I strongly advise you try understand what your regular expression is doing. Read up on regular expressions if you don't understand them (I recommend http://www.regular-expressions.info/). The problems you're facing are easy to solve once you understand a little about regular expressions. I'm going to leave it to you to find the remaining problems. – moinudin Jan 03 '11 at 13:27
  • @marcog, i tried the following, and it works, i still need a little help for just www.google.com .....replaceAll( "(?<!(?:https?|ftps?|http?)://)www\\.[\\w/%.\\-?&=+#]+", "$0") – Mahmoud Saleh Jan 03 '11 at 16:15
1

As marcog said, you should escape the - and to match the last 2 examples you gave, you have to make the http optionnal. Also http? matches htt wich is not a correct protocol.

So the regex will be:

"(?:(?:https?|ftps?)://)?[\\w/%.?&=-]+"
Toto
  • 89,455
  • 62
  • 89
  • 125