4

I have a page which load a 500 mb xml file and parses the file using an xsl template. The parser works perfectly in my local environment. I am using WAMP.

On the web server.

Warning: DOMDocument::load() [domdocument.load]: (null)xmlSAX2Characters: out of memory in /home/mydomain/public_html/xslt/largeFile.xml, line: 2031052 in /home/mydomain/public_html/xslt/parser_large.php on line 6

My Code is as below, line 6 loads the xml file

<?php
$xslDoc = new DOMDocument();
$xslDoc->load("template.xslt");

$xmlDoc = new DOMDocument();
$xmlDoc->load("largeFile.xml");

$proc = new XSLTProcessor();
$proc->importStylesheet($xslDoc);
echo $proc->transformToXML($xmlDoc);
?>

I have tried copying the php.ini file from the wamp installation to the folder where the above code is located. But this has not helped. The memory limit in this php.ini file is memory_limit = 1000M

Any advice / experience on this would be greatly appreciated

Santosh Pillai
  • 1,311
  • 1
  • 20
  • 31

1 Answers1

5

Here is the sad truth. There are two basic ways of working with XML, DOM-based, where the whole XML file is present in memory at once (with considerable overhead to make it fast to traverse), and SAX based where the file goes through memory, but only a small portion of it is present at any given time.

However, with DOM, large memory consumption is pretty much normal.

Now XSLT language in general allows constructions that access any parts of the whole file at any time and it therefore requires the DOM style. Some programming languages have libraries that allow feeding SAX input into an XSLT processor, but this necessarily implies restrictions on the XSLT language or memory consumption not much better than that of DOM. PHP does not have a way of making XSLT read SAX input, though.

That leaves us with alternatives to DOM; there is one, and is called SimpleXML. SimpleXML is is a little tricky to use if your document has namespaces. An ancient benchmark seems to indicate that it is somewhat faster, and probably also less wasteful with memory consumption, than DOM on large files.

And finally, I was in your shoes once in another programming language. The solution was to split the document into small ones based on simple rules. Each small document contained a header copied from the whole document, one "detail" element and a footer, making its format valid against the big XML file's schema. It was processed using XSLT (assuming that processing of one detail element does not look into any other detail element) and the outputs combined. This works like charm but it is not implemented in seconds.

So, here are your options. Choose one.

  • Parse and process XML using SAX.
  • Use SimpleXML and hope that it will allow slightly larger files within the same memory.
  • Execute an external XSLT processor and hope that it will allow slightly larger files within the same memory.
  • Split and merge XML using this method and apply XSLT on small chunks only. This method is only practical with some schemas.
Community
  • 1
  • 1
Jirka Hanika
  • 13,301
  • 3
  • 46
  • 75
  • I am on PHP Version 5.3.10 both on WAMP and the webserver. The same xml file is parsed correctly using WAMP. – Santosh Pillai Jun 26 '12 at 08:03
  • @SantoshPillai - What's the length of the longest continuous text in your XML file? – Jirka Hanika Jun 26 '12 at 10:29
  • 150 characters is the longest continuous line. – Santosh Pillai Jun 26 '12 at 13:42
  • @SantoshPillai - and each line contains at least one tag? Continuous text means basically text not interrupted by `<` or `>` characters; newlines don't matter. Continuous text may contain `<` or `>` which might display as angle brackets in some text editors. – Jirka Hanika Jun 26 '12 at 14:22
  • yes each line has a tag. Some of continuous text in my xml file looks like this. SubNetwork=Root,SubNetwork=node007,MeContext=node007,ManagedElement=1,vsDataTransportNetwork=1,vsDataUniSaalTp=caecca – Santosh Pillai Jun 26 '12 at 15:03
  • @SantoshPillai - And can you also confirm no `<![CDATA[` tags anywhere in the whole document? Then you are not affected by the known bug and we'll look for some other cause. – Jirka Hanika Jun 26 '12 at 15:21
  • There is no CDATA tag anywhere in the xml doc. I tested with a smaller xml doc (40mb, the large doc is 500mb) of similar type and it works. I am able to see the parsed output for the smaller xml doc – Santosh Pillai Jun 26 '12 at 15:26
  • Many thanks for the detailed explanation on this, you clearly have very good experience with xml parsing. Would you suggest using JavaBridge for PHP for parsing. This method seems to use SAX. I just saw this today at http://php.net/manual/en/book.xsl.php Is this easy to implement? – Santosh Pillai Jun 27 '12 at 08:36
  • @SantoshPillai - PHP/JavaBridge is just a bridge. It has nothing to do with XSLT or with transforming XML at all. If you already know how to solve your problem in Java, then it gives you an additional option to glue PHP and Java together, but if cannot help you transform XML itself. – Jirka Hanika Jun 27 '12 at 09:38
  • I have tried parsing using xalan and it works (this is DOM). I dont know how to use java SAX parser. From the options in your solution, the first option seems good. I would still like to use XSLT as I have lots of work already put in for templates, hence I was thinking in the direction of using java. Since the file is big, I don't know if 2nd and 3rd options would work. The last option looks difficult too since the xml file is complex with a large tree structure with loads of branches and sub-branches – Santosh Pillai Jun 27 '12 at 11:22
  • @SantoshPillai - Thank you for recording your resolution. I removed the obsolete part from the answer and agree with your evaluation of the last option; my own XML use cases have been almost relational (lots of independent records all with the same pretty much flat schema). – Jirka Hanika Jun 27 '12 at 12:54