As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are heavily dependant on regex, I thought trying to write a HTML sanitizer that uses the DOMDocument and related classes would be a worthwhile experiment. While I'm at a very early stage with this, the project so far shows some promise.
My idea revolves around a class that uses the DOMDocument to traverse all nodes in the supplied markup, compare them to a white list, and remove anything not on the white list. (first implementation is very basic, only removing nodes based on their type but I hope to get more sophisticated and analyse the node's attributes, whether links address items in a different domain, etc in the future).
My question is how do I traverse the DOM tree? As I understand it, DOM* objects have a childNodes attribute, so would I need to recurse over the whole tree? Also, early experiments with DOMNodeLists have shown you need to be very careful about the order you remove things otherwise you might leave items behind or trigger exceptions.
If anyone has experience with manipulating a DOM tree in PHP I'd appreciate any feedback you may have on the topic.
EDIT: I've built the following method for my HTML cleaning class. It recursively walks the DOM tree and checks whether the found elements are on the whitelist. If they aren't, they are removed.
The problem I was hitting was that if you delete a node, the indexes of all subsequent nodes in the DOMNodeList changes. Simply working from bottom to top avoids this problem. It's still a very basic approach currently, but I think it shows promise. It certainly works a lot faster than HTMLPurifier, though admittedly Purifier does a lot more stuff.
/**
* Recursivly remove elements from the DOM that aren't whitelisted
* @param DOMNode $elem
* @return array List of elements removed from the DOM
* @throws Exception If removal of a node failed than an exception is thrown
*/
private function cleanNodes (DOMNode $elem)
{
$removed = array ();
if (in_array ($elem -> nodeName, $this -> whiteList))
{
if ($elem -> hasChildNodes ())
{
/*
* Iterate over the element's children. The reason we go backwards is because
* going forwards will cause indexes to change when elements get removed
*/
$children = $elem -> childNodes;
$index = $children -> length;
while (--$index >= 0)
{
$removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
}
}
}
else
{
// The element is not on the whitelist, so remove it
if ($elem -> parentNode -> removeChild ($elem))
{
$removed [] = $elem;
}
else
{
throw new Exception ('Failed to remove node from DOM');
}
}
return ($removed);
}