1

I am trying to parse YouTube's top 15 videos feed. An excerpt of the feed I am trying to parse looks like the following:

<entry>
    <title>The Title</title>
    <link href="http://example.com" />
    <media:thumbnail url="http://example.com/image.png" />
    <media:description>The Description</media:description>
    <media:statistics views="123456" />
    <pubDate>29/01/2017</pubDate>
</entry>

I am unable to capture any of the values that use the tags beginning with <media:. I am using the following code to parse the data; commented lines are those that don't work.

foreach ($xml->entry as $val) {
    echo "<item>".PHP_EOL;
    echo "<title>".$val->title."</title>".PHP_EOL;
    echo "<link>".$val->link["href"]."</link>".PHP_EOL;
    //echo "<image>".$val->media:thumbnail["url"]."</image>".PHP_EOL;
    //echo "<description>".$val->media:description."</description>".PHP_EOL;
    //echo "<views>".$val->media:statistics["views"]."</views>".PHP_EOL;
    echo "<pubDate>".$val->published."</pubDate>".PHP_EOL;
    echo "</item>".PHP_EOL;
}

How can I get the values of these tags without setting up namespaces. doing a var_dump on $xml->entry doesn't even show the namespaced elements. Is there a better, built in function for converting XML into arrays?

jpl42
  • 113
  • 1
  • 4
  • Your XML is not well-formed (i.e., invalid). According to the [W3C Namespaces in XML 1.0](https://www.w3.org/TR/REC-xml-names/#ns-using): *the namespace prefix, unless it is xml or xmlns, MUST have been declared in a namespace declaration attribute*. So `media` prefix should be declared. – Parfait Jan 29 '17 at 22:01
  • A lot more difficult then with DOM+Xpath. Register own prefixes on the DOMXpath instance and use DOMXpath::evaluate() to fetch node lists and values. – ThW Jan 30 '17 at 10:09
  • I don't have time to write a full answer right now, but the method you're looking for is [`->children()`](http://php.net/manual/en/simplexmlelement.children.php). In your case `$val->children('media', true)->description` would work, although I'd recommend hard-coding the actual namespace URI (from the `xmlns:media` attribute) rather than the prefix, in case the source document is regenerated with different prefixes. – IMSoP Jan 30 '17 at 10:24
  • @ThW XPath doesn't seem like a good fit for this use case to me, and learning to use it and the DOM feels more complex than a few calls to `->children()` and `->attributes()`. – IMSoP Jan 30 '17 at 10:25
  • @Parfait It's an excerpt, not a full document; hence it's also missing the `` declaration. The `xmlns:media` attribute will be at the unshown root of the document. That said, it would be great if it could be converted to a [mcve] with those parts added back in. – IMSoP Jan 30 '17 at 10:57
  • @IMSoP It is really simple: https://eval.in/726878 – ThW Jan 30 '17 at 14:13
  • @ThW Sure, easy enough if you already know XPath. For comparison, here's how I'd write it in SimpleXML: https://eval.in/726881 Personally, I find SimpleXML more readable in general, although there's not much in it in this case; but it's certainly not "a lot more difficult". The only fiddly bit is the `->attributes(null)`, because [unprefixed attributes are a bit of an anomaly](http://stackoverflow.com/a/10673325/157957). – IMSoP Jan 30 '17 at 14:25

2 Answers2

0

Got my answer from the code provided by IMSoP. The PHP snippet I ended up using was adapted from aforementioned link, using XML similar to that of the OP:

foreach ($xml->children(NS_ATOM)->entry as $entry) {
    echo "<item>".PHP_EOL;
    echo "<title>".$entry->title."</title>".PHP_EOL;
    echo "<link>".$entry->link->attributes(null)->href."</link>".PHP_EOL;
    echo "<image>".$entry->children(NS_MEDIA)->group->children(NS_MEDIA)->thumbnail->attributes(null)->url."</image>".PHP_EOL;
    echo "<description>".$entry->children(NS_MEDIA)->group->children(NS_MEDIA)->description."</description>".PHP_EOL;
    echo "<guid>".$entry->children(NS_YT)->videoId."</guid>".PHP_EOL;
    echo "<views>".$entry->children(NS_MEDIA)->group->children(NS_MEDIA)->community->children(NS_MEDIA)->statistics->attributes(null)->views."</views>".PHP_EOL;
    echo "<pubDate>".$entry->published."</pubDate>".PHP_EOL;
    echo "</item>".PHP_EOL;
}

Hope this can help somebody in the future. It was the easiest example of XML namespace parsing I've come across so far.

jpl42
  • 113
  • 1
  • 4
0

Consider XSLT, the sibling to XPath, as you are essentially transforming original XML, not really parsing select values. With XSLT, you would need no foreach loop and can adequately handle namespaces.

In fact as shown below XSLT is the fastest of aforementioned methods ( SimpleXML querying and XPath evaluating) using posted XML wrapped in a <feed ...> root:

Simple XML (from @IMSoP)

$time_start = microtime(true);

$xml = file_get_contents('YoutubeFeed.xml');
$document = new SimpleXMLElement($xml);
define('NS_ATOM', 'http://www.w3.org/2005/Atom');
define('NS_MEDIA', 'http://search.yahoo.com/mrss/');

foreach ($document->children(NS_ATOM)->entry as $entry) {
    echo "<item>".PHP_EOL;
    echo "<title>".$entry->title."</title>".PHP_EOL;
    echo "<link>".$entry->link->attributes(null)->href."</link>".PHP_EOL;
    echo "<image>".$entry->children(NS_MEDIA)->thumbnail->attributes()->url."</image>".PHP_EOL;
    echo "<description>".$entry->children(NS_MEDIA)->description."</description>".PHP_EOL;
    echo "<guid>".$entry->children(NS_MEDIA)->guid."</guid>".PHP_EOL;
    echo "<views>".$entry->children(NS_MEDIA)->statistics->attributes()->views."</views>".PHP_EOL;
    echo "<pubDate>".$entry->published."</pubDate>".PHP_EOL;
    echo "</item>".PHP_EOL;
}

Timing

echo "SimpleXML: " . (microtime(true) - $time_start) ."\n";
# SimpleXML: 0.0014688968658447

XPATH (from @ThW)

$time_start = microtime(true);

$xml = file_get_contents('YoutubeFeed.xml');
$document = new DOMDocument();
$document->loadXml($xml);

$xpath = new DOMXpath($document);
$xpath->registerNamespace('atom', 'http://www.w3.org/2005/Atom');
$xpath->registerNamespace('media', 'http://search.yahoo.com/mrss/');

foreach ($xpath->evaluate('//atom:entry') as $entry) {
   echo "<item>".PHP_EOL;
   echo "<title>". $xpath->evaluate('string(atom:title)', $entry)."</title>".PHP_EOL;
   echo "<link>". $xpath->evaluate('string(atom:link/@href)', $entry)."</link>".PHP_EOL;
   echo "<image>". $xpath->evaluate('string(media:thumbnail/@url)', $entry)."</image>".PHP_EOL;
   echo "<description>". $xpath->evaluate('string(media:description)', $entry)."</description>".PHP_EOL;
   echo "<guid>". $xpath->evaluate('string(media:guid)', $entry)."</description>".PHP_EOL;
   echo "<views>".$xpath->evaluate('string(media:statistics/@views)', $entry)."</guid>".PHP_EOL;
   echo "<pubDate>". $xpath->evaluate('string(atom:pubdate)', $entry)."</views>".PHP_EOL;
   echo "</item>".PHP_EOL;
}

Timing

echo "XPATH: " . (microtime(true) - $time_start) ."\n";
# XPATH: 0.0012829303741455

XSLT

$time_start = microtime(true);

$xml = file_get_contents('YoutubeFeed.xml');
$document = new DOMDocument();
$document->loadXml($xml);

$xslstr = '<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
                xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"
                exclude-result-prefixes="atom media">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

   <xsl:template match="feed">
    <xsl:apply-templates select="atom:entry"/>
   </xsl:template>

   <xsl:template match="atom:entry">
      <item>
         <title><xsl:value-of select="atom:title"/></title>
         <link><xsl:value-of select="atom:link/@href"/></link>
         <image><xsl:value-of select="atom:thumbnail/@url"/></image>
         <description><xsl:value-of select="media:description"/></description>
         <guid><xsl:value-of select="media:guid"/></guid>
         <views><xsl:value-of select="media:statistics/@views"/></views>
         <pubDate><xsl:value-of select="atom:pubdate"/></pubDate>
      </item>
  </xsl:template>
</xsl:stylesheet>';

$xsl = new DOMDocument;
$xsl->loadXML($xslstr);

// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); 

// Transform XML source
$newXML = $proc->transformToXML($document);

// Echo string output
echo $newXML;

Timing

echo "XSLT: " . (microtime(true) - $time_start) ."\n";
# XSLT: 0.00098896026611328

Even with more <entry> nodes, copying tag and children to 500 lines, XSLT scales much better. Below units are in seconds:

# SimpleXML: 0.62154388427734

# XPATH: 0.68382000923157

# XSLT: 0.011976957321167
Parfait
  • 104,375
  • 17
  • 94
  • 125