issues with my regex to detect urls in a string?

Question

Greetings all. I am using the following regex to detect urls in a string and wrap them inside the < a > tag

public static String detectUrls(String text) {

        String newText = text
                .replaceAll("(?:https?|ftps?|http?)://[\\w/%.-?&=]+",
                        "<a href='$0'>$0</a>").replaceAll(
                        "(www\\.)[\\w/%.-?&=]+", "<a href='http://$0'>$0</a>");
        return newText;
    }

i have a problem that the following links are not detected correctly: i am not that good with regex, so please advise.

http://code.google.com/p/shindig-dnd/

http://confluence.atlassian.com/display/GADGETDEV/Gadgets+and+JIRA+Portlets

www.liferay.com/web/raymond.auge/blog/

(www.opensocial.org/)

http://www.google.com

Checkout http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url — ismail, Dec 23 '10 at 13:31

score 3 · Answer 1 · answered Dec 23 '10 at 13:31

3

I'm using this:

private static final String URL_REGEX = 
   "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";

Matcher matcher = URL_PATTERN.matcher(text);
text = matcher.replaceAll("<a href=\"$0\">$0</a>");
return text;

answered Dec 23 '10 at 13:31

Bozho

588,226
146
1,060
1,140

1

Declare `&` instead of `&` would suffice because `a`, `m` and `p` are already in the range `a-z` and `;` is delared twice. – Toto Dec 23 '10 at 14:06
this pattern works fine for most cases, but didn't catch this case: (www.opensocial.org) – Mahmoud Saleh Dec 26 '10 at 09:23

score 2 · Accepted Answer · answered Dec 23 '10 at 13:36

2

The problem you have is that you are using - within a character group ([]) without escaping it, which is being used to define the range .-? (i.e. the characters ./0123456789:;<=>?). Either escape it \\- or put it at the end of the character class so that it doesn't complete a range.

public static String detectUrls(String text) {
    String newText = text
            .replaceAll("(?:https?|ftps?|http?)://[\\w/%.\\-?&=]+",
                    "<a href='$0'>$0</a>").replaceAll(
                    "(www\\.)[\\w/%.\\-?&=]+", "<a href='http://$0'>$0</a>");
    return newText;
}

answered Dec 23 '10 at 13:36

moinudin

134,091
45
190
216

@marcog: there's actually one pattern that's still not catched: something like http: //www.google.com – Mahmoud Saleh Dec 27 '10 at 12:30
@sword Is that space after `http:` a typo? – moinudin Dec 27 '10 at 12:40
@marcog,yes i meant to add it coz without the space the editor will convert it to http://www.google.com so i add this to skip the editor formatting, and you know what i want to say right ? – Mahmoud Saleh Dec 27 '10 at 16:07
@marcog, what do you suggest ? – Mahmoud Saleh Dec 27 '10 at 16:33
@sword Swap the `replaceAll()` calls around and use [negative lookbehind](http://www.regular-expressions.info/lookaround.html). Here's it working: [http://ideone.com/Dj6ew](http://ideone.com/Dj6ew) with one minor issue - it also adds `http://` in front of the displayed URL. This is a limitation of regular expressions, and to fix it you'll have to parse the text in one pass without regular expressions. – moinudin Dec 27 '10 at 16:39
@marcog, thanks a lot for your feedback, still it didn't catch the following url correctly http://www.listenarabic.com/ar/Music+Egypt/ any ideas ? – Mahmoud Saleh Jan 03 '11 at 12:51
@sword That's because your character set doesn't include a `+`. Add it int the []'s. – moinudin Jan 03 '11 at 12:52
@marcog, you are righ, i added it still this pattern is not catched https://www.google.com – Mahmoud Saleh Jan 03 '11 at 13:24
@sword I strongly advise you try understand what your regular expression is doing. Read up on regular expressions if you don't understand them (I recommend http://www.regular-expressions.info/). The problems you're facing are easy to solve once you understand a little about regular expressions. I'm going to leave it to you to find the remaining problems. – moinudin Jan 03 '11 at 13:27
@marcog, i tried the following, and it works, i still need a little help for just www.google.com .....replaceAll( "(?<!(?:https?|ftps?|http?)://)www\\.[\\w/%.\\-?&=+#]+", "$0") – Mahmoud Saleh Jan 03 '11 at 16:15

score 1 · Answer 3 · answered Dec 23 '10 at 14:03

1

As marcog said, you should escape the - and to match the last 2 examples you gave, you have to make the http optionnal. Also http? matches htt wich is not a correct protocol.

So the regex will be:

"(?:(?:https?|ftps?)://)?[\\w/%.?&=-]+"

answered Dec 23 '10 at 14:03

Toto

89,455
62
89
125

issues with my regex to detect urls in a string?

3 Answers3

Linked