How to use Java Regex to sanitize html that accept only and tag?

Question

Ok, let say I got textarea and user can input any sort of text into it.

Then I want to put this text into a div element. For example,

document.getElementById('myDiv').innerHTML=text;

The issue is that user can put html code into it and it can distort the div. However, the text can contain  or .

So I want to replace all < with < & all > with > & except  or .

Note that: space before and after i are allowed so we will keep  , , , etc. Also,  /  &  /  must go in pair. That means if there is a  but there is no  then it should escape  & it should do the same with .

so, How to use Java Regex to sanitize html that accept only  and  tag?

If you want to enforce pairing, and since pairs can be nested, you cannot use regex, because Java's regex doesn't support nesting. — Andreas, Jan 28 '16 at 02:10
Obligatory reading : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — user949300, Jan 28 '16 at 02:10
OK, SO deleted my edit of my comment. Please use an HTML parser instead of a REGEX for this. — user949300, Jan 28 '16 at 02:18
@sln Not sure, but it's not allowed in XML, however web browser HTML parsers are notoriously lenient, so they may allow it regardless of specifications. — Andreas, Jan 28 '16 at 02:24
How about using \*italic\*, \*\*blold\*\* instead of italic, bold. — , Jan 28 '16 at 02:28

score 0 · Answer 1 · answered Jan 28 '16 at 02:22

0

You cannot enforce pairing with regex, but if you simply want to eliminate all html artifacts, except  and  and their matching end tags, you need two replaceAll() calls.

input.replaceAll("&", "&amp;").replaceAll("<(?!/?\\s*[bi]\\s*>)", "&lt;");

answered Jan 28 '16 at 02:22

Andreas

154,647
11
152
247

In case "there is a but there is no ". – Jan 28 '16 at 02:40
@saka1029 You could just generate `
user text here
`. That way, any unended **bold** or *italic* tags started by user text will be ended. Or you could use something other than regex. Or you could do some regex `find()` after the `replaceAll()` calls to check if there are unended **bold** or *italic* tags, then end only when needed. – Andreas Jan 28 '16 at 07:24

How to use Java Regex to sanitize html that accept only and tag?

1 Answers1