handling large HTML files in PHP

Question

I am trying to write a script to process (clean up, reformat) an HTML file, using DOM. Here is my code for loading the file:

$dom = new DOMDocument();
$dom->loadHTML($htmFName, LIBXML_PARSEHUGE);

And here is my code for traversing the document and inspecting/modifying the nodes:

class DOMTraverser
{
    private $node;
    public function __construct(DOMNode $node)
    {
        $this->node = $node;
    }

    public function traverse(GeneralCallBack $cb, $param) {
        $cb->callBefore($this->node, $param);
        foreach ($this->node->childNodes as $subnode) {
            if ($subnode->hasChildNodes()) {
                // $trav = new DOMTraverser($subnode);
                // $trav->traverse($cb, $param);
                $this->traverse($cb, $param);
            }
        }
        $cb->callAfter($this->node);
    }
}

...

$trav = new DOMTraverser($dom)
$callback = new StoryDocCallback();
$trav->traverse($callback, $storyParms);

The problem is reported in the foreach statement of the traverse function:

    Fatal error: Allowed memory size of 134217728 bytes exhausted (tried
to allocate 4096 bytes) in D:\D\src\inc\DOMTraverser.cl on line 17

My input file is large (2.6MB, with nearly 15,000 tags), but nowhere near the 134MB size mentioned in the error message.

How can I process this file without running out of memory. Would I be better off doing this in Java?

Side note: while the "allocated memory size" of 134,217,728 bytes seems like a lot, it's actually rather small compared with the 6GB of memory on my system. Maybe there's a configuration variable I could change?

PHP 7.0.8

Possible duplicate of [PHP Memory Limit](http://stackoverflow.com/questions/3792058/php-memory-limit) — Jonathan, Feb 18 '17 at 06:00
Well, that at least changed the result: Now I get "out of memory". Yes, after I changed the memory limit to -1 (no limit), I discovered that. I need to rework my code a little. (A quick fix involving a new "DOMTraverser" for each subnode solved the endless loop problem, but I still need to straighten my code out. Thanks for pointing this out. — A. P. Damien, Feb 18 '17 at 06:09
The other approach for really large XML files is using XMLReader/XMLWriter. You would read the XML file using an XMLReader instance an create a modified copy using the XMLWriter instance. This allows you to keep the memory consumption down. — ThW, Feb 18 '17 at 13:42
It turns out memory consumption isn't the problem. Paul Crovella identified my error: I was recursively examining the _same_ node instead of the subnodes. I was tempted by XMLReader/XMLWriter, but I don't see any way to load HTML (and especially the _broken_ HTML that MS Word produces) into anything except DOM. I can go from DOM to SimpleXML, but SimpleXML doesn't seem to have any methods for modifying the XML tree. — A. P. Damien, Feb 18 '17 at 19:32

handling large HTML files in PHP

0 Answers0