How to parse an XML node with a colon tag using PHP

Question

I am trying to fetch the value of the following nodes from [this URL (takes quite some time to load)][1]. The elements I'm interested in are:

title, g:price and g:gtin

The XML starts like this:

<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <channel>
    <title>PhotoSpecialist.de</title>
    <link>http://www.photospecialist.de</link>
    <description/>
    <item>
      <g:id>BEN107C</g:id>
      <title>Benbo Trekker Mk3 + Kugelkopf + Tasche</title>
      <description>
        Benbo Trekker Mk3 + Kugelkopf + Tasche Das Benbo Trekker Mk3 ist eine leichte Variante des beliebten Benbo 1. Sein geringes Gewicht macht das Trekker Mk3 zum idealen Stativ, wenn Sie viel draußen fotografieren und viel unterwegs sind. Sollten Sie in eine Situation kommen, in der maximale Stabilität zählt, verfügt das Benbo Trekker Mk3 über einen Haken an der Mittelsäule. An diesem können Sie das Stativ mit zusätzlichem Gewicht bei Bedarf beschweren. Dank der zwei besonderen Kamera-Befestigungsschrauben können Sie mit dem Benbo Trekker Mk3 sehr nah am Boden fotografieren. So nah, dass in vielen Fällen die einzige Einschränkung die Größe Ihrer Kamera darstellt. In diesem Set erhalten Sie das Benbo Trekker Mk3 zusammen mit einem Kugelkopf, Socket und einer Tasche für den sicheren und komfortablen Transport.
      </description>
      <link>
        http://www.photospecialist.de/benbo-trekker-mk3-kugelkopf-tasche?dfw_tracker=2469-16
      </link>
      <g:image_link>http://static.fotokonijnenberg.nl/media/catalog/product/b/e/benbo_trekker_mk3_tripod_kit_with_b__s_head__bag_ben107c1.jpg</g:image_link>
      <g:price>199.00 EUR</g:price>
      <g:condition>new</g:condition>
      <g:availability>in stock</g:availability>
      <g:identifier_exists>TRUE</g:identifier_exists>
      <g:brand>Benbo</g:brand>
      <g:gtin>5022361100576</g:gtin>
      <g:item_group_id>0</g:item_group_id>
      <g:product_type>Tripod</g:product_type>
      <g:mpn/>
      <g:google_product_category>Kameras & Optik</g:google_product_category>
    </item>
  ...
  </channel>
</rss>

To get this, I have written the following code:

$z = new XMLReader;
$z->open('https://my.datafeedwatch.com/static/files/1248/8222ebd3847fbfdc119abc9ba9d562b2cdb95818.xml');

$doc = new DOMDocument;

while ($z->read() && $z->name !== 'item')
    ;

while ($z->name === 'item')
{
    $node = new SimpleXMLElement($z->readOuterXML());
    $a = $node->title;
    $b = $node->price;
    $c = $node->gtin;
    echo $a . $b . $c . "<br />";
    $z->next('item');
}

This returns me only the title...price and gtin are not showing.

My bad, you're using [**SimpleXMLElement** to access the attributes with their own namespace](http://stackoverflow.com/q/6576773/367456). So the linked duplicate is not entirely correct (you could just use [`XMLReader::expand()`](https://php.net/manual/en/xmlreader.expand.php) to obtain the **DOMElement** directly, convert to DOM via `dom_import_simplexml` or for sure access the namespaced attributes via SimpleXML directly like in the linked Q&A in this comment). — hakre, Apr 26 '15 at 11:27
@hakre...i can't use simplexml as the XML is large so XMLReader is to be used — user3305327, Apr 26 '15 at 11:30
Huh? You actually use SimpleXML in your questions code. I was not speaking about switching away from **XMLReader** when I mentioned it. — hakre, Apr 26 '15 at 11:42
@hakre...oops sorry...actually am very new to this XML coding...btw can you please help me with this problem — user3305327, Apr 26 '15 at 11:50

score 12 · Accepted Answer · answered Apr 26 '15 at 12:17

12

The elements you're asking about are not part of the default namespace but in a different one. You can see that because they have a prefix in their name separated by the colon:

  ...
  <channel>
    <title>PhotoSpecialist.de</title>
    <!-- title is in the default namespace, no colon in the name -->
    ...
    <g:price>199.00 EUR</g:price>
    ...
    <g:gtin>5022361100576</g:gtin>
    <!-- price and gtin are in a different namespace, colon in the name and prefixed by "g" -->
  ...

The namespace is given with a prefix, here "g" in your case. And the prefix the namespace stands for is defined in the document element here:

<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">

So the namespace is "http://base.google.com/ns/1.0".

When you access the child-elements by their name with the SimpleXMLElement as you currently do:

$a = $node->title;
$b = $node->price;
$c = $node->gtin;

you're looking only in the default namespace. So only the first element actually contains text, the other two are created on-thy-fly and are yet empty.

To access the namespaced child-elements you need to tell the SimpleXMLElement explicitly with the children() method. It creates a new SimpleXMLElement with all the children in that namespace instead of the default one:

$google = $node->children("http://base.google.com/ns/1.0");

$a = $node->title;
$b = $google->price;
$c = $google->gtin;

So much for the isolated example (yes, that's it already).

A full example then could look like (including node-expansion on the reader, the code you had was a bit rusty):

<?php
/**
 * How to parse an XML node with a colon tag using PHP
 *
 * @link http://stackoverflow.com/q/29876898/367456
 */
const HTTP_BASE_GOOGLE_COM_NS_1_0 = "http://base.google.com/ns/1.0";

$url = 'https://my.datafeedwatch.com/static/files/1248/8222ebd3847fbfdc119abc9ba9d562b2cdb95818.xml';

$reader = new XMLReader;
$reader->open($url);

$doc = new DOMDocument;

// move to first item element
while (($valid = $reader->read()) && $reader->name !== 'item') ;

while ($valid) {
    $default    = simplexml_import_dom($reader->expand($doc));
    $googleBase = $default->children(HTTP_BASE_GOOGLE_COM_NS_1_0);
    printf(
        "%s - %s - %s<br />\n"
        , htmlspecialchars($default->title)
        , htmlspecialchars($googleBase->price)
        , htmlspecialchars($googleBase->gtin)
    );

    // move to next item element
    $valid = $reader->next('item');
};

I hope this both gives an explanation and broadens the view a little on XMLReader use as well.

answered Apr 26 '15 at 12:17

hakre

193,403
52
435
836

@hakre..thanks for such nice informative post...its a tutorial for me thanks once again – user3305327 Apr 26 '15 at 17:19
1

An even better variant might be with using DOMXpath. But I have remembered this too late now :) ThW had such an example with **XMLReader**, I take a look if I find a link. --- ***Edit:*** here it is, the example fits really nicely: http://stackoverflow.com/a/23079179/367456 – hakre Apr 26 '15 at 20:22
"only the first element actually contains text, the other two are created on-thy-fly and are yet empty" - that's not really true; *all* the child elements or attributes are *retrieved* on-demand ([here](http://lxr.php.net/xref/PHP_TRUNK/ext/simplexml/simplexml.c#246), ultimately), it's just that the call to `->elements($ns)` or `->attributes($ns)` tells SimpleXML *which* ones to retrieve. I find SimpleXML feels less surprising if you think of it as an *API*, like the DOM but simpler, rather than as objects which "contain" data. – IMSoP Apr 27 '15 at 19:13
@IMSoP; I like your description (I've read some of your recent answers in the SimpleXML tag, really *very* well written, makes me a bit jealous but hopefully my English profits from reading) but some of those elements are also created when accessed, at least when you write data into them: https://eval.in/319535 - that's what I meant with create on the fly. The original document didn't contain that element (this is for `$b` and `$c` in my answer above). – hakre Apr 27 '15 at 19:52
@hakre Ah, I think I see what you mean, but they won't be created just by reading them: https://eval.in/319537 Since the question is only about reading, the fact that you *could* create them by assigning a value is kind of by-the-by. Still, an interesting point that referencing them isn't *invalid*, just *not useful for the current task*. :) – IMSoP Apr 27 '15 at 20:01
@IMSoP: Actually following the lxr link you gave shows those are created if they yet not exist. SimpleXMLElement returns an object here (hence the API), so I must say that in the end, created on the fly by the API isn't that wrong, isn't it? ^^ :) http://lxr.php.net/xref/PHP_TRUNK/ext/simplexml/simplexml.c#358 – hakre Apr 27 '15 at 20:58
@hakre My reading was that a new *object* is created every time the property access handler is used, regardless of node (element/attribute) existence - so even elements returned fine are "created on the fly". I'm not entirely clear how that works for non-existent nodes (both because my C skills are rudimentary, and because of the shocking scarcity of comments in the source), but it clearly doesn't imply any changes to the underlying document structure, since the XML output is not changed. – IMSoP Apr 27 '15 at 21:18
@IMSoP: I don't know 100% from C code but in userspace: Non-existent nodes are empty elements. It's not possible to create empty elements on empty elements. That is, the parent element must exist, the child is added on the fly. On an on-the-fly added child you can't add another child (let's try that: https://eval.in/319618 yes, like that). – hakre Apr 27 '15 at 21:44
And if you want, create a question :) we can garden it for reference. Let's keep it in Q&A form here on SO, no need to bury this into comments :) – hakre Apr 27 '15 at 21:46

score 0 · Answer 2 · answered Feb 10 '21 at 07:56

0

If the main tag is a string with colon, you must use

$xml->next($xml->localName);

to move to the next item element.

answered Feb 10 '21 at 07:56

revoke

529
4
9

How to parse an XML node with a colon tag using PHP

2 Answers2

Linked