I have to parse large XML files in php, one of them is 6.5 MB and they could be even bigger. The SimpleXML extension as I've read, loads the entire file into an object, which may not be very efficient. In your experience, what would be the best way?
-
Check out [Pull Parsing in PHP](http://www.ibm.com/developerworks/xml/library/x-pullparsingphp/index.html) – Randolpho Jul 22 '09 at 17:58
-
The article is about XMLReader: http://php.net/manual/en/book.xmlreader.php "Unlike SimpleXML, it's a full XML parser that handles all documents, not just some of them. Unlike DOM, it can handle documents larger than available memory. Unlike SAX, it puts your program in control." – WayFarer Jan 04 '12 at 19:34
-
I have heard people having good success with XMLReader: http://php.net/manual/en/book.xmlreader.php – Steven Jul 23 '09 at 01:00
7 Answers
For a large file, you'll want to use a SAX parser rather than a DOM parser.
With a DOM parser it will read in the whole file and load it into an object tree in memory. With a SAX parser, it will read the file sequentially and call your user-defined callback functions to handle the data (start tags, end tags, CDATA, etc.)
With a SAX parser you'll need to maintain state yourself (e.g. what tag you are currently in) which makes it a bit more complicated, but for a large file it will be much more efficient memory wise.

- 59,820
- 9
- 127
- 177
My take on it:
https://github.com/prewk/XmlStreamer
A simple class that will extract all children to the XML root element while streaming the file. Tested on 108 MB XML file from pubmed.com.
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();

- 927
- 2
- 9
- 18
-
oskarth : I am not getting how to use this class, coudld you enlighten me little ? Or may be can you post complete code ? – www.amitpatil.me Dec 18 '12 at 19:21
-
4
-
I was previously using `XMLReader`, but it crashes if the document is not well formed. This class solves the problem and is much faster. – Drahcir Dec 02 '13 at 14:36
-
I'm glad it helped you! @www.amitpatil.me: Sorry for being a year too late with this answer but.. there's a readme on github now :) – oskarth Dec 22 '13 at 21:56
-
When using a DOMDocument
with large XML files, don't forget to pass the LIBXML_PARSEHUGE
flag in the options of the load()
method. (Same applies for the other load
methods of the DOMDocument
object)
$checkDom = new \DOMDocument('1.0', 'UTF-8');
$checkDom->load($filePath, LIBXML_PARSEHUGE);
(Works with a 120mo XML file)

- 7,201
- 2
- 50
- 98
A SAX Parser, as Eric Petroelje recommends, would be better for large XML files. A DOM parser loads in the entire XML file and allows you to run xpath queries-- a SAX (Simple API for XML) parser will simply read one line at a time and give you hook points for processing.

- 420
- 3
- 7
-
-
Object Oriented example: http://php-and-symfony.matthiasnoback.nl/2012/04/php-create-an-object-oriented-xml-parser-using-the-built-in-xml_-functions/ – Reza S Nov 27 '13 at 22:24
It really depends on what you want to do with the data? Do you need it all in memory to effectively work with it?
6.5 MB is not that big, in terms of today's computers. You could, for example, ini_set('memory_limit', '128M');
However, if your data can be streamed, you may want to look at using a SAX parser. It really depends on your usage needs.

- 131,293
- 12
- 98
- 101
-
4Even though the file itself is 6.5MB, after parsing, it's much bigger. I had this 20MB xml, when calling `xml_parse_into_struct`, I need to set memory_limit to 512MB, or else it will fail. – faulty Mar 13 '11 at 17:17
SAX parser is the way to go. I've found that SAX parsing can get messy if you don't stay organised.
I use an approach based on STX (Streaming Transformations for XML) to parse large XML files. I use the SAX methods to build a SimpleXML object to keep track of the data in the current context (ie just the nodes between the root and the current node). Other functions are then used for processing the SimpleXML document.

- 11,912
- 7
- 55
- 67
I needed to parse a large XML file that happened to have an element on each line (the StackOverflow data dump). In this specific case it was sufficient to read the file one line at a time and parse each line using SimpleXML. For me this had the advantage of not having to learn anything new.

- 19,819
- 24
- 83
- 123