I'm thinking of adding a rich text editor to allow a non-programmer to change the aspect of text. However, one issue is that it's possible to distort the layout of a rendered page if the markup is incorrect. What's a good lightweight way to sanitize html?
5 Answers
You will have to decide between good and lightweight. The recommended choice is 'HTMLPurifier', because it provide no-fuss secure defaults. As faster alternative it is often advised to use 'htmLawed'.
See also this quite objective overview from the HTMLPurifier author: http://htmlpurifier.org/comparison

- 144,265
- 20
- 237
- 291
-
Thanks. I got HTMLPurifier working. The documentation isn't easy to read but I managed to get it to filter some rich text to a minimum and adapted the charset to iso to avoid accents getting removed. – James P. Apr 05 '11 at 13:46
-
1To someone who consider htmLawed: first look at the code - you'll cry. There's no alternative to HTMLPurifier at this moment. Just to save your time – ymakux Dec 16 '16 at 13:47
-
What's wrong with the code? Just because you cannot understand it does not make it bad. htmLawed is just too much faster, smaller and more efficient that HTMLPurifier to not consider because it is not written the way you like. – user594694 Feb 23 '17 at 21:00
-
The HTMLLawed author seems to have no sense of security. The website and forum is not using HTTPS, and the website [urges you to disable Composer's secure-http](http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/composer_usage.htm), as he cannot be arsed to move to HTTPS or a Git repository. I wouldn't trust anything security-related to that person. – DennisK Oct 12 '18 at 07:30
I really like HTML Purifier, which allows you to specify which tags and attirbutes are allowed in your HTML code -- and generates valid HTML.

- 395,085
- 80
- 655
- 663
Use BB codes (or like here on SO), otherwise chances are very slim. Example function...
function parse($string){
$pattern = array(
"/\[url\](.*?)\[\/url\]/",
"/\[img\](.*?)\[\/img\]/",
"/\[img\=(.*?)\](.*?)\[\/img\]/",
"/\[url\=(.*?)\](.*?)\[\/url\]/",
"/\[red\](.*?)\[\/red\]/",
"/\[b\](.*?)\[\/b\]/",
"/\[h(.*?)\](.*?)\[\/h(.*?)\]/",
"/\[p\](.*?)\[\/p\]/",
"/\[php\](.*?)\[\/php\]/is"
);
$replacement = array(
'<a href="\\1">\\1</a>',
'<img alt="" src="\\1"/>',
'<img alt="" class="\\1" src="\\2"/>',
'<a rel="nofollow" target="_blank" href="\\1">\\2</a>',
'<span style="color:#ff0000;">\\1</span>',
'<span style="font-weight:bold;">\\1</span>',
'<h\\1>\\2</h\\3>',
'<p>\\1</p>',
'<pre><code class="php">\\1</code></pre>'
);
$string = preg_replace($pattern, $replacement, $string);
$string = nl2br($string);
return $string;
}
...
echo parse("[h2]Lorem Ipsum[/h2][p]Dolor sit amet[/p]");
Result...
<h2>Lorem Ipsum</h2><p>Dolor sit amet</p>
Or just use HTML Purifier :)

- 19,244
- 7
- 52
- 66
-
Good suggestion. I'm wondering why an animated dragon appeared when upvoting you though :p . – James P. Apr 01 '11 at 11:30
-
7In order for BBCode to be secured, you would have to run it through a a purifier such as [HTMLPurifier](http://htmlpurifier.org/) anyway. **There's really no point.** Naive BBCode is wide open to exploits: consider what the input string `[img]http://picture.of.a/pony.png" onload="execute(); arbitrary(); javascript();[/img]` would be produced as using the above parser. – Lauren Apr 01 '11 at 11:59
-
1Yup, definitely not for public usage, I ignored security aspect completely, I thought it was for private usage. @James P., use HTMLPurifier ;) – Dejan Marjanović Apr 01 '11 at 12:06
Both HTML Purifier and htmLawed are good. htmLawed has the advantage of a much smaller footprint and high configurability. Besides doing the standard work of balancing tags, filtering specific HTML tags or their attributes or attribute content (through white or black lists), etc., it also allows the use of custom functions.

- 327
- 4
- 13
Using the HTML Sanitizer API it's easy to do:
// our input string to clean
const stringToClean = 'Some text <b><i>with</i></b> <blink>tags</blink>, including a rogue script <script>alert(1)</script> def.';
const result = new Sanitizer().sanitizeToString(stringToClean);
console.log(result);
// Logs: "Some text <b><i>with</i></b> <blink>tags</blink>, including a rogue script def."

- 328
- 2
- 10