2

I know there is a lot of discussion for years on best methods of filtering data with PHP but I would like to go the whitelist approach in my current project.

I only want a user to be able to use the following HTML

<b>bold</b>
<i>italics</i>
<u>underline</u>
<s>strikethrough</s>
<big>Big size</big >
<small>Small size</small>

Hyperlink <a href="http://www.site.com">website</a>

A Bulleted List:
<ul>
<li>One Item</li>
<li>Another Item</li>
</ul>

An Ordered List:
<ol>
<li> First Item</li>
<li> Second Item</li>
</ol>

<blockquote>Because it is indented</blockquote>

<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>

Can anyone show me the best method of doing this for performance in PHP? I have only in the past allowed all html minus certain codes

JasonDavis
  • 48,204
  • 100
  • 318
  • 537
  • On a side note you should use instead of because is not valid html. – Scott Dec 29 '09 at 17:06
  • 5
    `` is valid HTML. It is not deprecated. It is not removed. (Although some people think it should be). Bold text is a form of presentation and is not the same as strong emphasis. You shouldn't simply replace any instance of `` with ``, you should consider what semantics are correct for the specific situation and use whatever is best (which might be something that isn't `` combined with CSS `font-weight`). – Quentin Dec 29 '09 at 17:18

4 Answers4

8

I believe the HTML Purifier Library will work nicely:

http://htmlpurifier.org/

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications. Tired of using BBCode due to the current landscape of deficient or insecure HTML filters? Have a WYSIWYG editor but never been able to use it? Looking for high-quality, standards-compliant, open-source components for that application you're building? HTML Purifier is for you!

gahooa
  • 131,293
  • 12
  • 98
  • 101
1

The simplest solution would be strip_tags(), which accepts a second argument containing allowable tags:

strip_tags($string, "<b><i><u><a><s><big><small><ul><li><ol><blockquote><h1><h2><h3>");
Mark
  • 6,254
  • 1
  • 32
  • 31
  • 9
    It's no good. strip_tags is a simplistic approach to a difficult problem, which has always had many workarounds to get bad content in. Even if it were bug-free, the lack of attribute filtering leaves you no way to disallow harmful constructs like ``. – bobince Dec 29 '09 at 17:30
1

Another route is using strip_tags with the second argument.

http://php.net/manual/en/function.strip-tags.php

Galen
  • 29,976
  • 9
  • 71
  • 89
1

I would run the submitted code through tidy to normalize it first, and then use xpath or apply xslt to only select allowed elements. This way, nothing can leak. Do bear in mind, too, that in any given website situation you would probably have thousands if not hundreds of thousands of read requests for every write request [that uses tidy and xpath/xslt] so on average the performance impact is negligible. If you are doing batch processing on the other hand..

Edit: oh and: DON'T do this with regular expressions. It is mathematically impossible to do it correctly.

mst
  • 247
  • 1
  • 5