Traverse the DOM tree

Question

As most (all?) PHP libraries that do HTML sanitization such as HTML Purifier are heavily dependant on regex, I thought trying to write a HTML sanitizer that uses the DOMDocument and related classes would be a worthwhile experiment. While I'm at a very early stage with this, the project so far shows some promise.

My idea revolves around a class that uses the DOMDocument to traverse all nodes in the supplied markup, compare them to a white list, and remove anything not on the white list. (first implementation is very basic, only removing nodes based on their type but I hope to get more sophisticated and analyse the node's attributes, whether links address items in a different domain, etc in the future).

My question is how do I traverse the DOM tree? As I understand it, DOM* objects have a childNodes attribute, so would I need to recurse over the whole tree? Also, early experiments with DOMNodeLists have shown you need to be very careful about the order you remove things otherwise you might leave items behind or trigger exceptions.

If anyone has experience with manipulating a DOM tree in PHP I'd appreciate any feedback you may have on the topic.

EDIT: I've built the following method for my HTML cleaning class. It recursively walks the DOM tree and checks whether the found elements are on the whitelist. If they aren't, they are removed.

The problem I was hitting was that if you delete a node, the indexes of all subsequent nodes in the DOMNodeList changes. Simply working from bottom to top avoids this problem. It's still a very basic approach currently, but I think it shows promise. It certainly works a lot faster than HTMLPurifier, though admittedly Purifier does a lot more stuff.

/**
 * Recursivly remove elements from the DOM that aren't whitelisted
 * @param DOMNode $elem
 * @return array List of elements removed from the DOM
 * @throws Exception If removal of a node failed than an exception is thrown
 */
private function cleanNodes (DOMNode $elem)
{
    $removed    = array ();
    if (in_array ($elem -> nodeName, $this -> whiteList))
    {
        if ($elem -> hasChildNodes ())
        {
            /*
             * Iterate over the element's children. The reason we go backwards is because
             * going forwards will cause indexes to change when elements get removed
             */
            $children   = $elem -> childNodes;
            $index      = $children -> length;
            while (--$index >= 0)
            {
                $removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
            }
        }
    }
    else
    {
        // The element is not on the whitelist, so remove it
        if ($elem -> parentNode -> removeChild ($elem))
        {
            $removed [] = $elem;
        }
        else
        {
            throw new Exception ('Failed to remove node from DOM');
        }
    }
    return ($removed);
}

Don't do that. Don't reinvent the wheel. Reuse the software already present. — dynamic, Jun 15 '11 at 10:12
The software already present is stuff like HTMLPurifier, which is slow as hell, and is based on regex. I'm doing this partially because I want to see if there's a better method, and partially because I want to use the exercise as a learning experience — GordonM, Jun 15 '11 at 10:26

score 9 · Accepted Answer · edited May 23 '17 at 12:20

For a start, you can have a look at this custom RecursiveDomIterator:

https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator

Code:

class RecursiveDOMIterator implements RecursiveIterator
{
    /**
     * Current Position in DOMNodeList
     * @var Integer
     */
    protected $_position;

    /**
     * The DOMNodeList with all children to iterate over
     * @var DOMNodeList
     */
    protected $_nodeList;

    /**
     * @param DOMNode $domNode
     * @return void
     */
    public function __construct(DOMNode $domNode)
    {
        $this->_position = 0;
        $this->_nodeList = $domNode->childNodes;
    }

    /**
     * Returns the current DOMNode
     * @return DOMNode
     */
    public function current()
    {
        return $this->_nodeList->item($this->_position);
    }

    /**
     * Returns an iterator for the current iterator entry
     * @return RecursiveDOMIterator
     */
    public function getChildren()
    {
        return new self($this->current());
    }

    /**
     * Returns if an iterator can be created for the current entry.
     * @return Boolean
     */
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    /**
     * Returns the current position
     * @return Integer
     */
    public function key()
    {
        return $this->_position;
    }

    /**
     * Moves the current position to the next element.
     * @return void
     */
    public function next()
    {
        $this->_position++;
    }

    /**
     * Rewind the Iterator to the first element
     * @return void
     */
    public function rewind()
    {
        $this->_position = 0;
    }

    /**
     * Checks if current position is valid
     * @return Boolean
     */
    public function valid()
    {
        return $this->_position < $this->_nodeList->length;
    }
}

You can use that in combination with a RecursiveIteratorIterator. Usage examples are on the page.

In general though, it would be easier to use XPath to search for blacklisted nodes instead of traversing the DOM Tree. Also keep in mind that DOM is already quite good at preventing XSS by automatically escaping xml entities in nodeValues.

The other thing you have to be aware of is that any manipulation of a DOMDocument will immediately affect any DOMNodeList you might have from XPath queries and that might lead to skipped nodes when manipulating them. See DOMNode replacement with PHP's DOM classes for an example.

Oh, and as far as blacklisting goes, I'd prefer to whitelist on the grounds that it's intrinsicly fail-safe, whilst blacklists are fail-unsafe. Should new elements be added to the HTML spec (as happened recently with HTML 5) and they could potentially be exploited for XSS attacks then the blacklist would let said elements through until the blacklist is fixed, whereas a whitelist would remove anything it doesn't recognise and you can add tags that you later consider to be safe. — GordonM, Jun 15 '11 at 12:17
thanks. But it's really just an iterator only. Nothing fancy. All it does is allow you to foreach over a DOM Tree. I just needed an iterator example to add to salathe's Wiki. Note that libxml does currently not fully support HTML5 and thus DOM will choke on some code: http://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-style-in-script-tag/4029412. So prepare for quirks if you want to support HTML5 with your project. — Gordon, Jun 15 '11 at 12:24
Yeah, looks like there's going to be some issues to work around with this, but my experiments so far are showing a lot of promise. Maybe if it goes well enough it could form the basis of an actual project on sourceforge or wherever :) — GordonM, Jun 15 '11 at 21:32

Traverse the DOM tree

1 Answers1

Linked