3

I have a 5MB XML file

I'm using the following code to get all nodeValue

$dom = new DomDocument('1.0', 'UTF-8');
if(!$dom->load($url))
    return;

$games = $dom->getElementsByTagName("game");
foreach($games as $game)
{
            
}

This takes 76 seconds and there are around 2000 games tag. Is there any optimization or other solution to get the data?

Syscall
  • 19,327
  • 10
  • 37
  • 52
Mokus
  • 10,174
  • 18
  • 80
  • 122
  • I can't imagine optimizing a loop without knowing what the loop does. – Herbert Sep 04 '11 at 15:15
  • 1
    look this link [enter link description here][1] [1]: http://stackoverflow.com/questions/188414/best-xml-parser-for-php – steve Sep 04 '11 at 15:15
  • 1
    @steve: maybe you can elaborate and put that in the form of an answer. How can SimpleXML speed up the loop to get at the data? – Herbert Sep 04 '11 at 15:18
  • 1
    You can find some useful suggestion in this [link][1] [1]: http://stackoverflow.com/questions/188414/best-xml-parser-for-php – monish Sep 04 '11 at 15:19
  • SimpleXML (as the others are suggesting) may speed up retrieval, but that 2000 iteration loop is where your performance problems are coming from. It would help to know what you want to do with the data. – Herbert Sep 04 '11 at 15:26

3 Answers3

2

I once wrote a blog article about loading huge XML files with XMLReader - you probably can use some of it.

Using DOM or SimpleXML is no option, since both load the whole document into memory.

cweiske
  • 30,033
  • 14
  • 133
  • 194
  • SimpleXml is quite good, I tested on an xml file, the DOM took around 30sec and the SimpleXML took 1sec:) – Mokus Sep 05 '11 at 11:01
  • SimpleXML has proved perfectly useful for OP and DOM is too slow - exactly as I suggested. XMLReader is the fastest along with SAX. – Alex Sep 05 '11 at 13:23
1

You can use DOMXpath for querying, which is way faster than the DOMDocument:: getElementsByTagName() method.

<?php
$xpath = new \DOMXpath($dom);
$games = $xpath->query("//game");

foreach ($games as $game) {
    // Code here
}

In one of my tests with a fairly large file, this approach took < 1 sec to complete the iteration of 24k elements, whereas the DOMDocument:: getElementsByTagName() method was taking ~27 min (and the time took to iterate to the next object was exponential).

paul.ago
  • 3,904
  • 1
  • 22
  • 15
1

You shouldn't use the Document Object Model on large XML files, it is intended for human readable documents, not big datasets!

If you want fast access you should use XMLReader or SimpleXML.

XMLReader is ideal for parsing whole documents, and SimpleXML has a nice XPath function for retreiving data quickly.

For XMLReader you can use the following code:

<?php

// Parsing a large document with XMLReader with Expand - DOM/DOMXpath 
$reader = new XMLReader();

$reader->open("tooBig.xml");

while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "game") {
             $node = $reader->expand();
             $dom = new DomDocument();
             $n = $dom->importNode($node,true);
             $dom->appendChild($n);
             $xp = new DomXpath($dom);
             $res = $xp->query("/game/title"); // this is an example
             echo $res->item(0)->nodeValue;
        }
    }
}
?>

The above will output all game titles (assuming you have /game/title XML structure).

For SimpleXML you can use:

$xml = file_get_contents($url);
$sxml = new SimpleXML($xml);
$games = $sxml->xpath('/game'); // returns an array of SXML nodes
foreach ($games as $game)
{
   print $game->nodeValue;
}
Alex
  • 4,844
  • 7
  • 44
  • 58
  • Thanks for your help. I have two questions what is the slash before the games. and How can I get the string in this element: object(SimpleXMLElement)[8991] string 'Handball' (length=8), I want the handball – Mokus Sep 04 '11 at 17:30
  • No probs... The slash in `/game` shows the root of the document. This is how XPath works (Google XPath for more info). In order to answer your second question I would need to see an example of the XML you are using. If you edit your question and paste it in, I can see it. – Alex Sep 04 '11 at 21:16
  • SimpleXML also loads the whole file, which brings absolutely no speed improvements. DOM itself has XPath support, too. – cweiske Sep 05 '11 at 09:15
  • @cweiske - Did you not notice I suggested XMLReader first? This is faster. To OP: Please read through these pages for everything you need to know about PHP and XML http://www.ibm.com/developerworks/xml/library/x-xmlphp1/index.html – Alex Sep 05 '11 at 13:22