7

I've building a command line php scraping app that uses XPath to analyze the HTML - the problem is every time a new DOMXPath class instance gets loaded in a loop I'm getting a memory loss roughly equal to the size of the XML being loaded. The script runs and runs, slowly building up memory usage until it hits the limit and quits.

I've tried forcing garbage collection with gc_collect_cycles() and PHP still isn't getting back memory from old Xpath requests. Indeed the definition of the DOMXPath class doesn't seem to even include a destructor function?

So my question is ... is there any way to force garbage clean up on DOMXPath after I've already extracted the necessary data? Using unset on the class instance predictably does nothing.

The code is nothing special, just standard Xpath stuff:

//Loaded outside of loop
$this->dom = new DOMDocument(); 

//Inside Loop
$this->dom->loadHTML($output);  
$xpath = new DOMXPath($this->dom);
$nodes = $xpath->query("//span[@class='ckass']");

//unset($this->dom) and unset($xpath) doesn't seem to have any effect

As you can see above I've kept the instantiation of a new DOMDocument class outside of the loop, although that doesn't seem to improve performance. I've even tried taking the $xpath class instance out of the loop and loading the DOM into Xpath directly using the __constructor method, memory loss is the same.

hakre
  • 193,403
  • 52
  • 435
  • 836
Corelloman
  • 71
  • 2
  • 1
    Are you destroying the nodes as well via `unset`? Also if youre jsut scraping and not modifiying the DOMs in question i woudl use SimpleXML instead. Its a bit mre lightweight and also supports xpath. – prodigitalson Nov 18 '11 at 20:38
  • Why `$this->dom`? You need to add the DOMDocument to a class member? – hakre Nov 18 '11 at 20:44
  • I'll check out SimpleXML - thanks! I use $this->dom since I declared $dom outside of the iterative function so it's not getting created with every iteration. – Corelloman Nov 18 '11 at 22:51

2 Answers2

4

After seeing this answer is her for years without a conclusion, finally an update! I now ran into a similar problem and it turns out that DOMXPath just leaks the memory and you can't control it. I have not searched if this has been reported on bug.php.net so far (this could be useful to edit in later).

The "working" solutions I have found to the problem are just workarounds. The basic idea was to replace the DOMNodeList Traversable returned by DOMXPath::query() with a different one containing the same nodes.

A most fitting work-around is with DOMXPathElementsIterator which allows you to query the concrete xpath expression you have in your question without the memory leaks:

$nodes = new DOMXPathElementsIterator($this->dom, "//span[@class='ckass']");

foreach ($nodes as $span) {
   ...
}

This class is now part of the development version of Iterator-Garden and $nodes is an iterator over all the <span> DOMElements.

The downside of this workaround is that the xpath result is limited to a SimpleXMLElement::xpath() result (this differs from DOMXPath::query()) because it's used internally to prevent the memory leak.

Another alternative is to make use of DOMNodeListIterator over a DOMNodeList like the one returned by DOMDocument::getElementsByTagname(). However these iterations are slow.

Hope this is of some use even the question was really old. It helped me in a similar situation.


Calling garbage collection cleanup circles makes only sense if the objects aren't referenced (used) any longer.

For example if you create a new DOMXPath object for the same DOMDocument over an over again (keep in mind it's connected to the DOMDocument that still exists), sounds like being your memory "leak". You just use more and more memory.

Instead you can just re-use the existing DOMXPath object as you re-use the DOMDocument object all the time. Give it a try:

//Loaded outside of loop
$this->dom = new DOMDocument(); 
$xpath = new DOMXPath($this->dom);

//Inside Loop
$this->dom->loadHTML($output);  
$nodes = $xpath->query("//span[@class='ckass']");
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Ahh thanks! I assumed I had to load a new DOMXPath every time I wanted to load new content, that was my mistake - thanks a ton!! – Corelloman Nov 18 '11 at 22:09
  • 1
    Edit- now that I've tried this, the $xpath variable doesn't seem to be taking in the content unless I redeclare "$xpath = new DOMXPath($this->dom);" after loading the content into $this->dom. :( – Corelloman Nov 18 '11 at 22:50
  • Okay, the old DOMDocument related objects stay in memory after loadHTML. Don't know your class design, but probably you should remove the DOM from the class member before loading. Delete the DOM, the XPath and results. Then create a new DOM and XPath inside the loop each time. It's a bit late, had overlooked the `loadHTML` statement. – hakre Nov 18 '11 at 23:36
  • Haha no worries, I appreciate the input nonetheless, it's something I should have tried. That's just where I don't get it - I've searched everywhere and don't see anyway to manually get rid of the DOM/Xpath classes. Unsetting the variables doesn't release the memory, is there another way to do this that I'm just missing? – Corelloman Nov 18 '11 at 23:56
  • You need to keep in mind that the XPath object internally shares data with DOMDocument. So if you create one object from a DOMDocument, you need to destroy both to clean up the memory. – hakre Nov 19 '11 at 00:08
  • But how do you destroy these objects? unset doesn't work, and I can't find any other methods to "destroy a class", since it's supposed to be automatic in PHP? – Corelloman Nov 19 '11 at 12:17
  • 1
    Let's say you did an xpath query. First unset the DOMNodeList it returned, then unset the DOMXPath, then unset the DOMDocument. Then call the garbage collector cycle. This should do it. And yes, `unset` is the way to destroy in your case, there is no other in PHP. In case you used elements from the DOMDocument somwhere else, unset them as well before unsetting the DOMDocument. All Nodes will contain a reference to the DOMDocument. As long as you don't unset them, the DOMDocument will stay in memory. – hakre Nov 19 '11 at 12:43
  • DOMXPath does not refresh if DOMDocument is reused. You need to call a new DOMXPath each time. I have the same memory problem and none of your suggestions work for me using PHP 5.3.6. Any further thoughts? –  Dec 04 '11 at 22:08
  • DOMXpath has a DOMDocument associated. If you change the document, it will still refer to the old nodes (I assume you executed a xpath query already). Destroy that xpath object and the result of the query as well. Try to encapsulate things in a function and/or class to limit scope and re-usage problems. – hakre Dec 14 '11 at 10:17
3

If you are using libxml_use_internal_errors(true);than it is the reason of memory leak because error list is growing.

Use libxml_clear_errors(); or check this answer for details.

Boy
  • 1,182
  • 2
  • 11
  • 28