Java .split() with regex to match html links

Question

I need to parse a string and escape all html tags except <a> links.

For example:

"Hello, this is <b>A BOLD</b> bit and this is <a href="www.google.com">a google</a> link"

When printed out in my jsp, I want to see the tags printed out as is (i.e. escaped so "A BOLD" is not actually in bold on the page) but the <a> link to be an actual link to google on the page.

I have got a little method that splits the incoming string based on a regex to match <a> links in various formats (with whites spaces, single or double quotes, etc). The regex is as follows:

myString.split("<a\\s[^>]*href\\s*=\\s*[\\\"\\|\\\'][^>]*[\\\"\\|\\\']\\s*>[^<\\/a>]*<\\/a>");

Yes it's horrid and probably hopelessly inefficient so open to alternative suggestions, but it does work up to a point. Where it falls down is parsing the link text bit. I want it to accept zero or more occurrences of any characters other than the </a> closing tag but it is parsing it as zero or more occurrences of any characters other than a "<" or "/" or "a" or ">", i.e. as individual characters rather than the complete </a> word. So it matches with any text that has an "e" in it for example.

The bit in question is: [^<\\/a>]*

How do I change this to match on the entire word not it's constituent characters? I've tried parentheses etc but nothing works.

Yours is a bad way to do it. Use a real parser. If it's XHTML, you can parse it as XML. — duffymo, Oct 13 '11 at 11:56
[Don't use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). You can try using [jsoup](http://jsoup.org/) instead. — skyuzo, Oct 13 '11 at 11:57
Just for my information: I assume the string to escape comes from a user input. Are xml parsers or JSoup robust enough for user syntax errors? E.g. won't they 'die' if user inputs something like: "
xml>"? — Laurent', Oct 13 '11 at 12:56

score 2 · Answer 1 · answered Oct 13 '11 at 12:23

2

You can clean your HTML without ruining <a> tags by using the jsoup HTML Cleaner with a Whitelist:

String unsafe = 
    "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.addTags("a"));
// now: &lt;p&gtr;<a href="http://example.com/" rel="nofollow">Link</a>&lt;/p&gtr;

answered Oct 13 '11 at 12:23

skyuzo

1,140
7
13

Thank you for the pointer. I had to modify it a bit to get it working: `String dirty = "Hello, testing the val a links to make sure
they work.";` `String cleaned = Jsoup.clean(dirty, new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));` Unfortunately this strips out the unwanted html completely. My requirement is to escape the unwanted html. I see there are OutputSettings on the Document object but that's not being used here. Am I missing something? – DM_Blunders Oct 13 '11 at 13:47
Try [AntiSamy](https://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project) instead. – skyuzo Oct 13 '11 at 20:19

score 0 · Answer 2 · answered Oct 13 '11 at 12:31

Although I agree with the consensual opinion that regex were not designed to parse x*ml, I feel that sometimes, you just haven't the time to learn, practice and implement new concepts and that a simple regex might well suffice in your case.

If you get enough time, learn xml parsers. Otherwise, here is an untested and maybe not userproof regex proposition to your problem (escape the slashes for java strings):

<\s*(?:[^aA]\b|[a-zA-Z0-9]{2,})[^>]*>

Which translates into:

<\s* # less-than character with optional space
(?:  # non capturing group of
  [^aA]\b         # a single letter which is not a nor A 
  |              # or
  [a-zA-Z0-9]{2,} # at least two alphanumeric characters
)
[^>]*> # ... anything until the first greater-than character

jsoup is really simple for what he wants though. – skyuzo Oct 13 '11 at 12:32 — skyuzo, Oct 13 '11 at 12:32

Java .split() with regex to match html links

2 Answers2