87

I have the following XML file, the file is rather large and i haven't been able to get simplexml to open and read the file so i'm trying XMLReader with no success in php

<?xml version="1.0" encoding="ISO-8859-1"?>
<products>
    <last_updated>2009-11-30 13:52:40</last_updated>
    <product>
        <element_1>foo</element_1>
        <element_2>foo</element_2>
        <element_3>foo</element_3>
        <element_4>foo</element_4>
    </product>
    <product>
        <element_1>bar</element_1>
        <element_2>bar</element_2>
        <element_3>bar</element_3>
        <element_4>bar</element_4>
    </product>
</products>

I've unfortunately not found a good tutorial on this for PHP and would love to see how I can get each element content to store in a database.

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Shadi Almosri
  • 11,678
  • 16
  • 58
  • 80
  • 1
    Have you read some of the user contributed examples in the PHP documentation? http://www.php.net/manual/en/class.xmlreader.php#61929 may help. – mcrumley Dec 02 '09 at 19:48

7 Answers7

235

It all depends on how big the unit of work, but I guess you're trying to treat each <product/> nodes in succession.

For that, the simplest way would be to use XMLReader to get to each node, then use SimpleXML to access them. This way, you keep the memory usage low because you're treating one node at a time and you still leverage SimpleXML's ease of use. For instance:

$z = new XMLReader;
$z->open('data.xml');

$doc = new DOMDocument;

// move to the first <product /> node
while ($z->read() && $z->name !== 'product');

// now that we're at the right depth, hop to the next <product/> until the end of the tree
while ($z->name === 'product')
{
    // either one should work
    //$node = new SimpleXMLElement($z->readOuterXML());
    $node = simplexml_import_dom($doc->importNode($z->expand(), true));

    // now you can use $node without going insane about parsing
    var_dump($node->element_1);

    // go to next <product />
    $z->next('product');
}

Quick overview of pros and cons of different approaches:

XMLReader only

  • Pros: fast, uses little memory

  • Cons: excessively hard to write and debug, requires lots of userland code to do anything useful. Userland code is slow and prone to error. Plus, it leaves you with more lines of code to maintain

XMLReader + SimpleXML

  • Pros: doesn't use much memory (only the memory needed to process one node) and SimpleXML is, as the name implies, really easy to use.

  • Cons: creating a SimpleXMLElement object for each node is not very fast. You really have to benchmark it to understand whether it's a problem for you. Even a modest machine would be able to process a thousand nodes per second, though.

XMLReader + DOM

  • Pros: uses about as much memory as SimpleXML, and XMLReader::expand() is faster than creating a new SimpleXMLElement. I wish it was possible to use simplexml_import_dom() but it doesn't seem to work in that case

  • Cons: DOM is annoying to work with. It's halfway between XMLReader and SimpleXML. Not as complicated and awkward as XMLReader, but light years away from working with SimpleXML.

My advice: write a prototype with SimpleXML, see if it works for you. If performance is paramount, try DOM. Stay as far away from XMLReader as possible. Remember that the more code you write, the higher the possibility of you introducing bugs or introducing performance regressions.

Josh Davis
  • 28,400
  • 5
  • 52
  • 67
  • 1
    is there a way to do this purely with XMLReader or is there no advantage? – Shadi Almosri Dec 03 '09 at 02:12
  • 2
    You could do it entirely with XMLReader. The advantage is that it would be faster and use marginally less memory. The disadvantage is it would take considerably longer to write and be very much harder to debug. – Josh Davis Dec 03 '09 at 05:36
  • 2
    Why didn't you just use $z->next('product') when moving to the first product node? – redolent Sep 19 '12 at 23:01
  • I don't remember that specific code, sorry. If I didn't add any note about it, it might just be that I overlooked the possibility. – Josh Davis Sep 20 '12 at 09:03
  • Would it be better performance to use simple_xml_loadstring($z->readInnerXml()); then creating a dom object just to parse and pass it to simple xml? For the XMLReader + SimpleXML option – thefreeman Oct 25 '12 at 14:01
  • 2
    Most of the XMLReader based parsing can be expressed / wrapped into the iterator pattern. I've compiled some useful iterators and filters for that: http://git.io/xmlreaderiterator ([gist](https://gist.github.com/hakre/5147685)) – hakre Apr 05 '13 at 23:03
  • One more point is that you forgot to mention, and that is that XSD validation is not possible by using only `SimpleXmlElement` – Oleg Belousov Sep 11 '15 at 15:36
  • Just did a quick benchmark using almost the same code with a document with 36824 nodes. Using `SimpleXMLElement` took 1.340s, `simplexml_import_dom` with `importNode` took 0.952s. Quite a difference. – Oytun Dec 05 '15 at 01:05
  • `$z->next()` somehow didn't work with me even though the document had sibling nodes. Just removed that line and let the `while` flow on its own, then it worked. – Oytun Dec 05 '15 at 01:07
  • And yet another quick benchmark: if you don't need a `SimpleXMLElement` but a `DOMNode` works for you, you can eliminate `simplexml_import_dom` call and go down to 0.841s :) – Oytun Dec 05 '15 at 01:10
  • SinpleXML: If **product** node have child, how to read it? I try **foreach($product->properties as $property)** but i receive **class SimpleXMLElement** in **$property** – Adobe Oct 17 '16 at 05:50
  • I ended up with [link]https://github.com/prewk/xml-string-streamer because I used a half day trying to make XMLReader continue parsing without stopping on some character it didnt recognize, but without luck. – Raf A. Nov 14 '17 at 13:14
  • Using `$z->next();` instead of `$z->next('product');` will move to the next child anyway but will make the skip much faster (twice faster in my tests). This is probably because we skip the unnecessary string check within the _next_ method. – J. Wrong Apr 11 '19 at 15:27
14

For xml formatted with attributes...

data.xml:

<building_data>
<building address="some address" lat="28.902914" lng="-71.007235" />
<building address="some address" lat="48.892342" lng="-75.0423423" />
<building address="some address" lat="58.929753" lng="-79.1236987" />
</building_data>

php code:

$reader = new XMLReader();

if (!$reader->open("data.xml")) {
    die("Failed to open 'data.xml'");
}

while($reader->read()) {
  if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'building') {
    $address = $reader->getAttribute('address');
    $latitude = $reader->getAttribute('lat');
    $longitude = $reader->getAttribute('lng');
}

$reader->close();
try5tan3
  • 213
  • 3
  • 9
  • 5
    Even though the code is a lot more verbose and manual way to walk through XML, this will save your sanity, since DOMDocument and SimpleXML tend to keep you guessing at what will be returned. – b01 May 03 '16 at 00:38
8

Most of my XML parsing life is spent extracting nuggets of useful information out of truckloads of XML (Amazon MWS). As such, my answer assumes you want only specific information and you know where it is located.

I find the easiest way to use XMLReader is to know which tags I want the information out of and use them. If you know the structure of the XML and it has lots of unique tags, I find that using the first case is the easy. Cases 2 and 3 are just to show you how it can be done for more complex tags. This is extremely fast; I have a discussion of speed over on What is the fastest XML parser in PHP?

The most important thing to remember when doing tag-based parsing like this is to use if ($myXML->nodeType == XMLReader::ELEMENT) {... - which checks to be sure we're only dealing with opening nodes and not whitespace or closing nodes or whatever.

function parseMyXML ($xml) { //pass in an XML string
    $myXML = new XMLReader();
    $myXML->xml($xml);

    while ($myXML->read()) { //start reading.
        if ($myXML->nodeType == XMLReader::ELEMENT) { //only opening tags.
            $tag = $myXML->name; //make $tag contain the name of the tag
            switch ($tag) {
                case 'Tag1': //this tag contains no child elements, only the content we need. And it's unique.
                    $variable = $myXML->readInnerXML(); //now variable contains the contents of tag1
                    break;

                case 'Tag2': //this tag contains child elements, of which we only want one.
                    while($myXML->read()) { //so we tell it to keep reading
                        if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Amount') { // and when it finds the amount tag...
                            $variable2 = $myXML->readInnerXML(); //...put it in $variable2. 
                            break;
                        }
                    }
                    break;

                case 'Tag3': //tag3 also has children, which are not unique, but we need two of the children this time.
                    while($myXML->read()) {
                        if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Amount') {
                            $variable3 = $myXML->readInnerXML();
                            break;
                        } else if ($myXML->nodeType == XMLReader::ELEMENT && $myXML->name === 'Currency') {
                            $variable4 = $myXML->readInnerXML();
                            break;
                        }
                    }
                    break;

            }
        }
    }
$myXML->close();
}
Community
  • 1
  • 1
Josiah
  • 3,008
  • 1
  • 34
  • 45
7

The accepted answer gave me a good start, but brought in more classes and more processing than I would have liked; so this is my interpretation:

$xml_reader = new XMLReader;
$xml_reader->open($feed_url);

// move the pointer to the first product
while ($xml_reader->read() && $xml_reader->name != 'product');

// loop through the products
while ($xml_reader->name == 'product')
{
    // load the current xml element into simplexml and we’re off and running!
    $xml = simplexml_load_string($xml_reader->readOuterXML());

    // now you can use your simpleXML object ($xml).
    echo $xml->element_1;

    // move the pointer to the next product
    $xml_reader->next('product');
}

// don’t forget to close the file
$xml_reader->close();
Francis Lewis
  • 8,872
  • 9
  • 55
  • 65
2

XMLReader is well documented on PHP site. This is a XML Pull Parser, which means it's used to iterate through nodes (or DOM Nodes) of given XML document. For example, you could go through the entire document you gave like this:

<?php
$reader = new XMLReader();
if (!$reader->open("data.xml"))
{
    die("Failed to open 'data.xml'");
}
while($reader->read())
{
    $node = $reader->expand();
    // process $node...
}
$reader->close();
?>

It is then up to you to decide how to deal with the node returned by XMLReader::expand().

sk8terboi87 ツ
  • 3,396
  • 3
  • 34
  • 45
Percutio
  • 979
  • 7
  • 8
  • how will you get it to move to the next node after it has finished processing one? – Shadi Almosri Dec 03 '09 at 02:09
  • 28
    Also regarding XMLReader being well documented on php.net i would disagree, it's one of the worst documented functions i've seen and i've used php.net for a long time and it was the first place i headed to to resolve this before asking here :) – Shadi Almosri Dec 03 '09 at 02:10
  • I'm not sure you understand the way XMLReader::read() goes from one node to another. The XMLReader class also uses libxml, a well known library that is also available for PHP if you want to take a look at it. – Percutio Dec 03 '09 at 04:06
  • 14
    The idea that XMLReader is well documented is nonsense. The problem is that if you don't know where to start, it doesn't tell you anywhere: giving a laundry list of class methods is useless if you don't have the first idea of which ones to call. – xgretsch Sep 28 '11 at 10:06
2
Simple example:

public function productsAction()
{
    $saveFileName = 'ceneo.xml';
    $filename = $this->path . $saveFileName;
    if(file_exists($filename)) {

    $reader = new XMLReader();
    $reader->open($filename);

    $countElements = 0;

    while($reader->read()) {
        if($reader->nodeType == XMLReader::ELEMENT) {
            $nodeName = $reader->name;
        }

        if($reader->nodeType == XMLReader::TEXT && !empty($nodeName)) {
            switch ($nodeName) {
                case 'id':
                    var_dump($reader->value);
                    break;
            }
        }

        if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'offer') {
            $countElements++;
        }
    }
    $reader->close();
    exit(print('<pre>') . var_dump($countElements));
    }
}
sebob
  • 515
  • 6
  • 14
0

This Works Better and Faster For Me


<html>
<head>
<script>
function showRSS(str) {
  if (str.length==0) {
    document.getElementById("rssOutput").innerHTML="";
    return;
  }
  if (window.XMLHttpRequest) {
    // code for IE7+, Firefox, Chrome, Opera, Safari
    xmlhttp=new XMLHttpRequest();
  } else {  // code for IE6, IE5
    xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
  }
  xmlhttp.onreadystatechange=function() {
    if (this.readyState==4 && this.status==200) {
      document.getElementById("rssOutput").innerHTML=this.responseText;
    }
  }
  xmlhttp.open("GET","getrss.php?q="+str,true);
  xmlhttp.send();
}
</script>
</head>
<body>

<form>
<select onchange="showRSS(this.value)">
<option value="">Select an RSS-feed:</option>
<option value="Google">Google News</option>
<option value="ZDN">ZDNet News</option>
<option value="job">Job</option>
</select>
</form>
<br>
<div id="rssOutput">RSS-feed will be listed here...</div>
</body>
</html> 

**The Backend File **


<?php
//get the q parameter from URL
$q=$_GET["q"];

//find out which feed was selected
if($q=="Google") {
  $xml=("http://news.google.com/news?ned=us&topic=h&output=rss");
} elseif($q=="ZDN") {
  $xml=("https://www.zdnet.com/news/rss.xml");
}elseif($q == "job"){
  $xml=("https://ngcareers.com/feed");
}

$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);

//get elements from "<channel>"
$channel=$xmlDoc->getElementsByTagName('channel')->item(0);
$channel_title = $channel->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
$channel_link = $channel->getElementsByTagName('link')
->item(0)->childNodes->item(0)->nodeValue;
$channel_desc = $channel->getElementsByTagName('description')
->item(0)->childNodes->item(0)->nodeValue;

//output elements from "<channel>"
echo("<p><a href='" . $channel_link
  . "'>" . $channel_title . "</a>");
echo("<br>");
echo($channel_desc . "</p>");

//get and output "<item>" elements
$x=$xmlDoc->getElementsByTagName('item');

$count = $x->length;

// print_r( $x->item(0)->getElementsByTagName('title')->item(0)->nodeValue);
// print_r( $x->item(0)->getElementsByTagName('link')->item(0)->nodeValue);
// print_r( $x->item(0)->getElementsByTagName('description')->item(0)->nodeValue);
// return;

for ($i=0; $i <= $count; $i++) {
  //Title
  $item_title = $x->item(0)->getElementsByTagName('title')->item(0)->nodeValue;
  //Link
  $item_link = $x->item(0)->getElementsByTagName('link')->item(0)->nodeValue;
  //Description
  $item_desc = $x->item(0)->getElementsByTagName('description')->item(0)->nodeValue;
  //Category
  $item_cat = $x->item(0)->getElementsByTagName('category')->item(0)->nodeValue;


  echo ("<p>Title: <a href='" . $item_link
  . "'>" . $item_title . "</a>");
  echo ("<br>");
  echo ("Desc: ".$item_desc);
   echo ("<br>");
  echo ("Category: ".$item_cat . "</p>");
}
?> 

Adeleye Ayodeji
  • 325
  • 3
  • 5