5

We have a severe memory leak in one of our regularly run scripts that quickly wipes out the free memory on the server. Despite many hours of research and experiments, though, I've been unable to even make a dent in it.

Here is the code:

    echo '1:'.memory_get_usage()."\n";
ini_set('memory_limit', '1G');
    echo '2:'.memory_get_usage()."\n";

$oXML = new DOMDocument();
    echo '3:'.memory_get_usage()."\n";
$oXML->load('feed.xml'); # 556 MB file
    echo '4:'.memory_get_usage()."\n";

$xpath = new DOMXPath($oXML);
    echo '5:'.memory_get_usage()."\n";
$oNodes = $xpath->query('//feed/item'); # 270,401 items
    echo '6:'.memory_get_usage()."\n";

unset($xpath);
    echo '7:'.memory_get_usage()."\n";
unset($oNodes);
    echo '8:'.memory_get_usage()."\n";
unset($oXML);
    echo '9:'.memory_get_usage()."\n";

And here is the output:

1:679016
2:679320
3:680128
4:680568
5:681304
6:150852408
7:150851840
8:34169968
9:34169448

As you can see, when we use xpath to load the nodes into an object, memory usage jumps from 681,304 to 150,852,408. I'm not terribly concerned about that.

My problem is that even after destroying the $oNodes object, we're still stuck at memory usage of 34,169,968.

But the real problem is that the memory usage that PHP shows is a tiny fraction of the total memory eaten by the script. Using free -m directly from the command line on the server, we go from 3,295 MB memory used to 5,226 MB -- and it never goes back down. We're losing 2 GB of memory every time this script runs, and I am at a complete loss as to why or how to fix it.

I tried using SimpleXML instead, but the results were basically identical. I also studied these three threads but didn't find anything in them that helped:

XML xpath search and array looping with php, memory issue

DOMDocument / Xpath leaking memory during long command line process - any way to deconstruct this class

DOMDocument PHP Memory Leak

I'm hoping this is something easy that I'm just overlooking.

UPDATE 11/10: It does appear that memory is eventually freed up. I noticed that after a little more than 30 minutes, suddenly a big block came free again. Obviously, though, that hasn't been nearly fast enough recently to keep the server from running out of memory and locking up.

And for what it's worth, we're running PHP 5.3.15 with Apache 2.2.3 on Red Hat 5.11. We're working to update to the latest versions of all of those, so somewhere along that upgrade path, we might find this fixed. It would be great to do it before then, though.

Community
  • 1
  • 1
Shane Pike
  • 159
  • 1
  • 8
  • Try seeing how many references `$oNodes` has before you `unset` http://php.net/manual/en/features.gc.refcounting-basics.php – Machavity Nov 09 '15 at 16:15
  • assuming apache, remember that each child has its own separate memory pool, and if you're processing "huge" xml docs in parallel, none of the dom infrastructure is shared between the children. you could set the max_requests_per_child (whatever the setting is) low, so apache will nuke/restart the children more frequently, which would release some of the held memory. – Marc B Nov 09 '15 at 16:39
  • @Machavity: We don't have Xdebug installed. Is there a way to count the references without it? – Shane Pike Nov 09 '15 at 19:30
  • @Marc B: I'm not sure what you mean by "child" in this context. I'm sorry. In this case, this script is being run a single time. It isn't ever being hit by two different processes at the same time. – Shane Pike Nov 09 '15 at 19:33
  • so this is a command-line PHP script? whatever memory it's sucking up would be released when the script exits. the only way it could "leak" memory after exit is if it was (ab)using some system functionality and THAT had the leak. – Marc B Nov 09 '15 at 19:34
  • @Marc B: Right? That's exactly what I would expect. The script runs, ends, memory freed. That's not what's happening by any means, though, *and* memory_get_usage() isn't telling the whole story either. – Shane Pike Nov 09 '15 at 19:39
  • 2
    php doesn't run its garbage collector just because you unset a variable. GC runs are highly expensive, computationally, so PHP won't run it until it HAS to, e.g. memory is getting tight. so memory_get_usage() isn't really a valid test. – Marc B Nov 09 '15 at 19:40
  • Not natively. Xdebug exposes more of what's happening under the hood. In this same vein, [this blog](http://blog.ircmaxell.com/2014/12/what-about-garbage.html) talks extensively about how garbage is collected – Machavity Nov 09 '15 at 19:45
  • Are you confusing the file system cache with a memory leak? – Sean Bright Nov 18 '15 at 21:07
  • Also - I don't know if there is additional functionality in this script that requires you to use the DOM, but if not, I would recommend using [`XMLReader`](http://php.net/manual/en/book.xmlreader.php) instead. – Sean Bright Nov 18 '15 at 21:11
  • Finally - to be clear because you didn't explicitly answer Mark's question - this is a script that is executed from the command line, a la `php feed-parser.php`? – Sean Bright Nov 18 '15 at 21:14

2 Answers2

0

Recently experienced a issue just like yours. We needed to extract data from a 3gb xml file and also noticed that server memory was reaching its limits. There are several ways you can decrease the memory usage;

  • instead of using xpath which causes the great amount of memory usage use (for example) file_get_contents. Then do a search via regular expression to find desired data
  • split the xml into smaller pieces. Basicly its reinventing the xml file, however you can handle the maximum sizes for the files (thus memory)

You mentioned that after 30 minutes some memory was released. Reading a 500mb xml over 30 minutes is way to slow. The solution we used is splitting up the 3gb xml file into several pieces (aprox 200). Our script writes the required data(around 700k records) to our database in less then 5 minutes.

PAlphen
  • 186
  • 1
  • 9
0

We just experienced a similar issue with PHPDocxPro (which uses DomDocument) and submitted a patch to them that at least improves upon the problem. The memory usage reported by get_memory_usage() never increased, as though PHP was not aware of the allocation at all. The memory reported while watching execution via top or ps is what we were more concerned about.

// ps reports X memory usage
var $foo = (new DomDocument())->loadXML(getSomeXML());
// ps reports X + Y memory usage
var $foo = (new DomDocument())->loadXML(getSomeXML());
// ps reports X + ~2Y memory usage
var $foo = (new DomDocument())->loadXML(getSomeXML());
// ps reports X + ~3Y memory usage

Adding an unset() before each subsequent call...

// ps reports X memory usage
var $foo = (new DomDocument())->loadXML(getSomeXML());
// ps reports X + Y memory usage
unset($foo);
var $foo = (new DomDocument())->loadXML(getSomeXML());
// ps reports X + ~Y memory usage
unset($foo);
var $foo = (new DomDocument())->loadXML(getSomeXML());
// ps reports X + ~Y memory usage

I haven't dug into the extension code to understand what's going on, but my guess is that they're allocating memory without using PHP's allocation, and as such, it's not being counted as part of the heap that get_memory_usage() considers. Despite this, there does appear to be some reference counting to determine whether or not memory can be freed. The unset($foo) before a subsequent call makes sure that the extension can reuse some resources. Without that, memory usage increases every time the code is run.

thatthatisis
  • 783
  • 7
  • 18