7

I have a wysiwyg text area in a Java webapp. Users can input text and style it or paste some already HTML-formatted text.

What I am trying to do is to linkify the text. This means, converting all possible URLs within text, to their "working counterpart", i.e. adding < a href="...">...< /a>.

This solution works when all I have is plain text:

String r = "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
Pattern pattern = Pattern.compile(r, Pattern.DOTALL | Pattern.UNIX_LINES | Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(comment);
comment = matcher.replaceAll("<a href=\"$0\">$0</a>"); // group 0 is the whole expression

But the problem is when there is some already formatted text, i.e. that it already has the < a href="...">...< /a> tags.

So I am looking for some way for the pattern not to match whenever it finds the text between two HTML tags (< a>). I have read this can be achieved with lookahead or lookbehind but I still can't make it work. I am sure I am doing it wrong because the regex still matches. And yes, I have been playing around/ debugging groups, changing $0 to $1 etc.

Any ideas?

Fabian Steeg
  • 44,988
  • 7
  • 85
  • 112
  • I wonder how many more questions about this topic are needed so that every permutation of the title already exists on SO and people start to use one of the solutions that has been worked out previously. – Tomalak Mar 10 '09 at 12:11
  • 1
    i spent a great deal of time with this one and did some research, but still couldn't figure out. stack overflow has helped me find the solution and now the whole community can take advantage of these answers. your comment is inaccurate and offending. –  Mar 10 '09 at 13:08
  • i also challenge you to show me one solution to this problem that was already on SO with a "permuted title" –  Mar 10 '09 at 13:12
  • @frank06: My comment is far from inaccurate. I spend much time here and I've seen this very question at least ten times already. The whole community obviously does not take advantage of it, since this is being asked continuously nevertheless, as it seems. – Tomalak Mar 10 '09 at 13:28
  • See for yourself: http://www.google.com/search?q=urls+links+regex+html+site%3Astackoverflow.com – Tomalak Mar 10 '09 at 13:29
  • @frank06: Just to make that clear - my comment was not against you personally. Your question is well-asked, you did your share of preparation/thinking etc. It will add to the other questions about the topic, and at some point, I hope, people actually *find* and *use* one of the existing solutions. – Tomalak Mar 10 '09 at 13:37
  • ok, i agree there was this: http://stackoverflow.com/questions/287144/need-a-good-regex-to-convert-urls-to-links-but-leave-existing-links-alone ... somehow google was not my friend on this one. hope no one else makes the mistake of posting the same again! –  Mar 10 '09 at 14:58
  • @frank06: Here we go again: http://stackoverflow.com/questions/635844. You see, it *is* repetitive. ;-) That's what I meant by "permutations": people use widely different titles for more or less the same problem. Unfortunately the "related questions" feature suggests similarly titled questions only. – Tomalak Mar 11 '09 at 19:04

5 Answers5

9

You are close. You can use a "negative lookbehind" like so:

(?<!href=")http:// etc

All results preceded by href will be ignored.

Kees de Kooter
  • 7,078
  • 5
  • 38
  • 45
1

If you want to use regex, (though I think parsing to XML/HTML first is more robust) I think look-ahead or -behind makes sense. A first stab might be to add this at the end of your regex:

(?!</a>)

Meaning: don't match if there's a closing a tag just afterwards. (This could be tweaked forever, of course.) This doesn't work well, though, because given the string

<a href="...">http://example.com/</a>

This regex will try to match "http://example.com/", fail due to the lookahead (as we hope), and then backtrack the greedy qualifier to have on the end and match "http://example.com" instead, which doesn't have a after it.

You can fix the latter problem by using a possessive qualifier on your +, * and ? operators - just stick a + after them. This prevents them from back-tracking. This is probably good for performance reasons, as well.

This works for me (note the three extra +'s):

String r = "http(s)?://([\\w+?\\.\\w+])++([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*+)?+(?!</a>)";
Jesse Rusak
  • 56,530
  • 12
  • 101
  • 102
1

If you really want to do it with regex, than:

   String r = "(?<![=\"\\/>])http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";

e.g. check that the URL is not following a =" or />

siddhadev
  • 16,501
  • 2
  • 28
  • 35
0

Perhaps html parsing will be more appropriate for you (htmlparser for example). Then you could have html nodes and only "linkify" links in the text and not in the attributes.

kgiannakakis
  • 103,016
  • 27
  • 158
  • 194
0

If you have to roll your own, at least look at the algorithms/patterns used in an Open Source implementation of Markdown, e.g., MarkdownJ.

Hank Gay
  • 70,339
  • 36
  • 160
  • 222