1

Ok, let say I got textarea and user can input any sort of text into it.

Then I want to put this text into a div element. For example,

document.getElementById('myDiv').innerHTML=text;

The issue is that user can put html code into it and it can distort the div. However, the text can contain <b> or <i>.

So I want to replace all < with &lt; & all > with &gt; & except <b> or <i>.

Note that: space before and after i are allowed so we will keep <i > , < i>, < i >, etc. Also, <b> / </b> & <i> / </i> must go in pair. That means if there is a <b> but there is no </b> then it should escape <b> & it should do the same with <i>.

so, How to use Java Regex to sanitize html that accept only <b> and <i> tag?

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Tum
  • 3,614
  • 5
  • 38
  • 63
  • 1
    If you want to enforce pairing, and since pairs can be nested, you cannot use regex, because Java's regex doesn't support nesting. – Andreas Jan 28 '16 at 02:10
  • 2
    Obligatory reading : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – user949300 Jan 28 '16 at 02:10
  • 2
    OK, SO deleted my edit of my comment. Please use an HTML parser instead of a REGEX for this. – user949300 Jan 28 '16 at 02:18
  • Is a space before the tag name legal html ? –  Jan 28 '16 at 02:20
  • @sln Not sure, but it's not allowed in XML, however web browser HTML parsers are notoriously lenient, so they may allow it regardless of specifications. – Andreas Jan 28 '16 at 02:24
  • How about using \*italic\*, \*\*blold\*\* instead of italic, bold. –  Jan 28 '16 at 02:28

1 Answers1

0

You cannot enforce pairing with regex, but if you simply want to eliminate all html artifacts, except <b> and <i> and their matching end tags, you need two replaceAll() calls.

input.replaceAll("&", "&amp;").replaceAll("<(?!/?\\s*[bi]\\s*>)", "&lt;");
Andreas
  • 154,647
  • 11
  • 152
  • 247
  • In case "there is a but there is no ". –  Jan 28 '16 at 02:40
  • @saka1029 You could just generate `
    user text here
    `. That way, any unended **bold** or *italic* tags started by user text will be ended. Or you could use something other than regex. Or you could do some regex `find()` after the `replaceAll()` calls to check if there are unended **bold** or *italic* tags, then end only when needed.
    – Andreas Jan 28 '16 at 07:24