I suggest that you use DOMDocument
(with loadHTML
) to load said HTML, remove every kind of tag and every attribute you don't want to see, and save back the HTML (using saveXML
or saveHTML
). You can do that by recursively iterating over the children of the document's root, and replacing tags you don't want by their inner contents. Since loadHTML
loads code in a similar way browsers do, it's a much safer way to do it than using regular expressions.
EDIT Here's a "purifying" function I made:
<?php
function purifyNode($node, $whitelist)
{
$children = array();
// copy childNodes since we're going to iterate over it and modify the collection
foreach ($node->childNodes as $child)
$children[] = $child;
foreach ($children as $child)
{
if ($child->nodeType == XML_ELEMENT_NODE)
{
purifyNode($child, $whitelist);
if (!isset($whitelist[strtolower($child->nodeName)]))
{
while ($child->childNodes->length > 0)
$node->insertBefore($child->firstChild, $child);
$node->removeChild($child);
}
else
{
$attributes = $whitelist[strtolower($child->nodeName)];
// copy attributes since we're going to iterate over it and modify the collection
$childAttributes = array();
foreach ($child->attributes as $attribute)
$childAttributes[] = $attribute;
foreach ($childAttributes as $attribute)
{
if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value))
$child->removeAttribute($attribute->name);
}
}
}
}
}
function purifyHTML($html, $whitelist)
{
$doc = new DOMDocument();
$doc->loadHTML($html);
// make sure <html> doesn't have any attributes
while ($doc->documentElement->hasAttributes())
$doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0));
purifyNode($doc->documentElement, $whitelist);
$html = $doc->saveHTML();
$fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html>
return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1
}
?>
You would call purifyHTML
with an unsafe HTML string and a predefined whitelist of tags and attributes. The whitelist format is 'tag' => array('attribute' => 'regex'). Tags that don't exist in the whitelist are stripped, with their contents inlined in the parent tag. Attributes that don't exist for a given tag in the whitelist are removed as well; and attributes that exist in the whitelist, but that don't match the regex, are removed as well.
Here's an example:
<?php
$html = <<<HTML
<p>This is a paragraph.</p>
<p onclick="alert('xss')">This is an evil paragraph.</p>
<p><a href="javascript:evil()">Evil link</a></p>
<p><script>evil()</script></p>
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
HTML;
// whitelist format: tag => array(attribute => regex)
$whitelist = array(
'b' => array(),
'i' => array(),
'u' => array(),
'p' => array(),
'img' => array('src' => '@\Ahttp://.+\Z@', 'alt' => '@.*@'),
'a' => array('href' => '@\Ahttp://.+\Z@')
);
$purified = purifyHTML($html, $whitelist);
echo $purified;
?>
The result is:
<p>This is a paragraph.</p>
<p>This is an evil paragraph.</p>
<p><a>Evil link</a></p>
<p>evil()</p>
<p>This is an evil image: <img></p>
<p>This is nice <b>bold text</b>.</p>
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>
Obviously, you don't want to allow any on*
attribute, and I would advise against style
because of weird proprietary properties like behavior
. Make sure all URL attributes are validated with a decent regex that matches the full string (\Aregex\Z
).