0

I have the following XML-structure in my XML file (it's not the whole XML-file, only a part of it):

<?xml version="1.0" encoding="utf-8"?>
    <extensions>
        <extension extensionkey="fp_product_features">
            <downloadcounter>355</downloadcounter>
            <version version="0.1.0">
                <title>Product features</title>
                <description/>
                <downloadcounter>24</downloadcounter>
                <state>beta</state>
                <reviewstate>0</reviewstate>
                <category>plugin</category>
                <lastuploaddate>1142878270</lastuploaddate>
                <uploadcomment> added related features</uploadcomment>
            </version>
        </extension>
    </extensions>

The file is too big for SimpleXML, so I'm using XMLReader. I have a switch that checks for the XML-tags and their content:

while ($xmlReader->read()) {

                if ($xmlReader->nodeType == XMLReader::ELEMENT) {

                    switch ($xmlReader->name) {

                        case "title" :

                            $xmlReader->read();
                            $foo = $xmlReader->value;
                            //Do stuff with the value

                            break;

                        case  "description":

                            $xmlReader->read();
                            $bar = $xmlReader->value;
                           //Do stuff with the value

                            break;

                        case "downloadcounter" :

                            $xmlReader->read();
                            $foobar = $xmlReader->value;
                           //Do stuff with the value

                            break;

                        case "state" :

                            $xmlReader->read();
                            $barfoo = $xmlReader->value;
                            //Do stuff with the value

                        break;


                     //Repeat for other tags

                    }
                }
            }

The problem here is that there are two <downloadcounter> tags. The one beneath <extension> and the one beneath <version>. I need the one beneath <version>, but the code in my switch is giving me the one beneath <extension>. All the other cases are giving me the right information.

I have thought about some solutions. Maybe there is a way where I can specify that XMLReader only reads the tag after <description>? I've been using the $xmlReader->read() function multiple times in one case, but that didn't help. I'm very new to this, so maybe it is not the right the way to do it, but if anyone can point me in the right direction, it would be much appreciated.

Thanks in advance!

Kevin Kromjong
  • 174
  • 3
  • 13
  • Have you tried using DOMDocument? You can then use an XPath to get the appropriate nodes. Alternatively, can you set a counter that will only record the second `` tag that you come across? – i alarmed alien Oct 20 '14 at 19:41

1 Answers1

2

Ok, some notes on this...

The file is too big for SimpleXML, so I'm using XMLReader.

That would mean that loading the XML file with SimpleXML reaches PHP's memory_limit, right? Alternatives would be to stream or chunk read the XML file and process the parts.

$xml_chunk = (.... read file chunked ...)
$xml = simplexml_load_string($xml_chunk);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

But working with XMLReader is fine!

Maybe there is a way where I can specify that XMLReader only reads the tag after ?

Yes, there is. Like "i alarmed alien" pointed out: if you work with DomDocument, you can use an Xpath query to reach the exact (node|item|element) you want.

$dom = new DomDocument();
$dom->load("tooBig.xml");
$xp = new DomXPath($dom);

$result = $xp->query("/extensions/extension/version/downloadcounter");

print $result->item(0)->nodeValue ."\n";

For more examples see the PHP manual: http://php.net/manual/de/domxpath.query.php


If you want to stick to XMLReader:

The XMLReader extension is an XML Pull parser. The reader is going forward on the document stream, stopping on each node on the way. This explains why you get the first from beneath the tag, but not the one beneath . This makes iterations hard, because lookahead and stuff is not really possible without re-reading.

DEMO http://ideone.com/Oykfyh

<?php

$xml = <<<'XML'
<?xml version="1.0" encoding="utf-8"?>
    <extensions>
        <extension extensionkey="fp_product_features">
            <downloadcounter>355</downloadcounter>
            <version version="0.1.0">
                <title>Product features</title>
                <description/>
                <downloadcounter>24</downloadcounter>
                <state>beta</state>
                <reviewstate>0</reviewstate>
                <category>plugin</category>
                <lastuploaddate>1142878270</lastuploaddate>
                <uploadcomment> added related features</uploadcomment>
            </version>
        </extension>
    </extensions>
XML;

$reader = new XMLReader();
$reader->open('data:/text/plain,'.urlencode($xml));

$result = [];
$element = null;

while ($reader->read()) {

  if($reader->nodeType === XMLReader::ELEMENT) 
  {
    $element = $reader->name;

    if($element === 'extensions') {
        $result['extensions'] = array();
    }

    if($element === 'extension') {
        $result['extensions']['extension'] = array();
    }

    if($element === 'downloadcounter') {
        if(!is_array($result['extensions']['extension']['version'])) {
            $result['extensions']['extension']['downloadcounter'] = '';
        } /*else {
            $result['extensions']['extension']['version']['downloadcounter'] = '';
        }*/
    }

    if($element === 'version') {
        $result['extensions']['extension']['version'] = array();
        while ($reader->read()) {
           if($reader->nodeType === XMLReader::ELEMENT) 
           {
               $element = $reader->name;
               $result['extensions']['extension']['version'][$element] = '';
           }
           if($reader->nodeType === XMLReader::TEXT) 
           {
               $value = $reader->value;
               $result['extensions']['extension']['version'][$element] = $value;
           }
        }
    }
  }

  if($reader->nodeType === XMLReader::TEXT) 
  {
    $value = $reader->value;

    if($element === 'downloadcounter') {
        if(!is_array($result['extensions']['extension']['version'])) {
            $result['extensions']['extension']['downloadcounter'] = $value;
        }
        if(is_array($result['extensions']['extension']['version'])) {
            $result['extensions']['extension']['version']['downloadcounter'] = $value;
        }
    }
  }
}
$reader->close();

echo var_export($result, true);

Result:

array (
  'extensions' => 
  array (
    'extension' => 
    array (
      'downloadcounter' => '355',
      'version' => 
      array (
        'title' => 'Product features',
        'description' => '',
        'downloadcounter' => '24',
        'state' => 'beta',
        'reviewstate' => '0',
        'category' => 'plugin',
        'lastuploaddate' => '1142878270',
        'uploadcomment' => ' added related features',
      ),
    ),
  ),
)

This transform your XML into an array (with nested arrays). It's not really perfect, because of unnecessary iterations. Feel free to hack away...

Additionally: - Parsing Huge XML Files in PHP - https://github.com/prewk/XmlStreamer

Community
  • 1
  • 1
Jens A. Koch
  • 39,862
  • 13
  • 113
  • 141
  • Thanks for the answer! I have one question: how do I need to modify this – Kevin Kromjong Oct 21 '14 at 08:28
  • the question is hard to answer. i don't know. it depends, on what you want to achieve. if you want the data value for a specific tag from the array: `echo $result['extensions']['extension']['version']['downloadcounter'];` – Jens A. Koch Oct 21 '14 at 09:56
  • Nevermind. I wanted to ask a more specific question, but the answer popped up in my head while I was typing. I think I accidentally hit the enter button and didn't notice that I posted an unfinished comment. Your solution is great and works well, thanks! – Kevin Kromjong Oct 21 '14 at 10:01
  • Do you know if XMLReader processes the data faster than DomDocument, or is DomDocument the fastest of the two? – Kevin Kromjong Oct 22 '14 at 11:12
  • That's an interesting question! I assume the following order (fastest first): XMLReader, SAX, SimpleXML, Dom. XMLReader has a a better memory footprint than DomDocument, especially on large XML files. SimpleXML and Dom work on the whole dataset in memory. – Jens A. Koch Oct 22 '14 at 12:11
  • I'm not aware of any comparison report (including speed) for these PHP implementations. Most of them are wrapper extensions using C/C++ libraries, for instance XMLReader wraps libxml. "XML Parser" wraps libxml and Expat. Sadly, PHP doesn't have extensions for asmXML, RapidXML or vtd-xml, which are the fastest XML parsers around. You will find benchmarks more on the C/C++ lib side, for instance here: http://pugixml.org/benchmark/ – Jens A. Koch Oct 22 '14 at 12:30