6

I have one solution to the subject problem, but it’s a hack and I’m wondering if there’s a better way to do this.

Below is a sample XML file and a PHP CLI script that executes an xpath query given as an argument. For this test case, the command line is:

./xpeg "//MainType[@ID=123]"

What seems most strange is this line, without which my approach doesn’t work:

$result->loadXML($result->saveXML($result));

As far as I know, this simply re-parses the modified XML, and it seems to me that this shouldn’t be necessary.

Is there a better way to perform xpath queries on this XML in PHP?


XML (note the binding of the default namespace):

<?xml version="1.0" encoding="utf-8"?>
<MyRoot
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.example.com/data http://www.example.com/data/MyRoot.xsd"
 xmlns="http://www.example.com/data">
  <MainType ID="192" comment="Bob's site">
    <Price>$0.20</Price>
    <TheUrl><![CDATA[http://www.example.com/path1/]]></TheUrl>
    <Validated>N</Validated>
  </MainType>
  <MainType ID="123" comment="Test site">
    <Price>$99.95</Price>
    <TheUrl><![CDATA[http://www.example.com/path2]]></TheUrl>
    <Validated>N</Validated>
  </MainType>
  <MainType ID="922" comment="Health Insurance">
    <Price>$600.00</Price>
    <TheUrl><![CDATA[http://www.example.com/eg/xyz.php]]></TheUrl>
    <Validated>N</Validated>
  </MainType>
  <MainType ID="389" comment="Used Cars">
    <Price>$5000.00</Price>
    <TheUrl><![CDATA[http://www.example.com/tata.php]]></TheUrl>
    <Validated>N</Validated>
  </MainType>
</MyRoot>

PHP CLI Script:

#!/usr/bin/php-cli
<?php

$xml = file_get_contents("xpeg.xml");

$domdoc = new DOMDocument();
$domdoc->loadXML($xml);

// remove the default namespace binding
$e = $domdoc->documentElement;
$e->removeAttributeNS($e->getAttributeNode("xmlns")->nodeValue,"");

// hack hack, cough cough, hack hack
$domdoc->loadXML($domdoc->saveXML($domdoc));

$xpath = new DOMXpath($domdoc);

$str = trim($argv[1]);
$result = $xpath->query($str);
if ($result !== FALSE) {
  dump_dom_levels($result);
}
else {
  echo "error\n";
}

// The following function isn't really part of the
// question. It simply provides a concise summary of
// the result.
function dump_dom_levels($node, $level = 0) {
  $class = get_class($node);
  if ($class == "DOMNodeList") {
    echo "Level $level ($class): $node->length items\n";
    foreach ($node as $child_node) {
      dump_dom_levels($child_node, $level+1);
    }
  }
  else {
    $nChildren = 0;
    foreach ($node->childNodes as $child_node) {
      if ($child_node->hasChildNodes()) {
        $nChildren++;
      }
    }
    if ($nChildren) {
      echo "Level $level ($class): $nChildren children\n";
    }
    foreach ($node->childNodes as $child_node) {
      if ($child_node->hasChildNodes()) {
        dump_dom_levels($child_node, $level+1);
      }
    }
  }
}
?>
danorton
  • 11,804
  • 7
  • 44
  • 52
  • I have adjusted this question to remove the nonsense in my original query. The solution below by Tomalek is on-point, but it requires complicating the queries by rewriting all the the names. The underlying problem is that DOMXpath (and XPath 1.0) does not provide support for a default namespace. A secondary issue might be with PHP, as the code does behave differently after removing the attribute but before re-scanning. – danorton Jun 25 '11 at 06:38

4 Answers4

13

The solution is using the namespace, not getting rid of it.

$result = new DOMDocument();
$result->loadXML($xml);

$xpath = new DOMXpath($result);
$xpath->registerNamespace("x", trim($argv[2]));

$str = trim($argv[1]);
$result = $xpath->query($str);

And call it as this on the command line (note the x: in the XPath expression)

./xpeg "//x:MainType[@ID=123]" "http://www.example.com/data"

You can make this more shiny by

  • finding out default namespaces yourself (by looking at the namespace property of the document element)
  • supporting more than one namespace on the command line and register them all before $xpath->query()
  • supporting arguments in the form of xyz=http//namespace.uri/ to create custom namespace prefixes

Bottom line is: In XPath you can't query //foo when you really mean //namespace:foo. These are fundamentally different and therefore select different nodes. The fact that XML can have a default namespace defined (and thus can drop explicit namespace usage in the document) does not mean you can drop namespace usage in XPath.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • 1
    @danorton - your document shows `xmlns="http://www.example.com/data"` as the default namespace - is this not supposed to be there? I agree with @Tomalak ( but have not been able to test to verify) the namespace is in the original document and all the elements are bound to that namespace when the document is originally parsed. Removing the namespace attribute doesn't remove this binding, but it does make the save and re-load hack work by ensuring that the elements are not bound when the document is re-parsed. – cordsen Jun 25 '11 at 03:40
  • I stand corrected @Tomalek. After some more homework, I'm now confused as to why I need to specify a namespace prefix at all. It seems that DOMXpath will only work with the default namespace if it is unbound and your solution binds an additional prefix to the same namespace name. – danorton Jun 25 '11 at 05:13
  • @danorton: Read the last paragraph in my answer ("Bottom line is:") ;) – Tomalak Jun 25 '11 at 05:17
  • Okay, so it's not DOMXpath specifically, but XPath 1.0 has no understanding of a default namespace: instead of translating an unprefixed name to the default namespace, it translates the absence of a prefix as the absence of a namespace. It looks like XPath 2.0 will correct this discrepancy between XML and XPath. – danorton Jun 25 '11 at 06:27
  • @danorton: Yes, that's exactly the way it is. :) – Tomalak Jun 25 '11 at 06:30
1

Just out of curiosity, what happens if you remove this line?

$e->removeAttributeNS($e->getAttributeNode("xmlns")->nodeValue,"");

That strikes me as the most likely to cause the need for your hack. You're basically removing the xmlns="http://www.example.com/data" part and then re-building the DOMDocument. Have you considered simply using string functions to remove that namespace?

$pieces = explode('xmlns="', $xml);
$xml = $pieces[0] . substr($pieces[1], strpos($pieces[1], '"') + 1);

Then continue on your way? It might even end up being faster.

cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
  • Uhm, that is an even uglier hack than what the OP originally did. This is on par with using regex for XML and you really should not recommend such a thing. – Tomalak Jun 25 '11 at 03:05
  • Removing that line removes the fix and the query fails. Personally, I feel that parsing XML using string functions is a bigger hack than telling the library to re-scan. Thanks, all the same. – danorton Jun 25 '11 at 03:05
  • @danorton While I admit that Tomalak's is better, I have to believe parsing a string before creation of XML is far better than creating XML, manipulating it, converting it to a string, and then re-parsing it. I'll definitely take this score without complaint, but considering the number of objects are being created, destroyed, and re-created by passing an object through string, I think that my solution is not quite so rank. – cwallenpoole Jun 25 '11 at 04:55
  • @cwallenpoole The amount of DOM objects created etc does hardly matter for a command line application that spawns an entire PHP process to do an XPath query. Above all it does not justify doing string manipulation on XML - this should be a taboo under any circumstances. – Tomalak Jun 25 '11 at 05:12
  • The command line application is only to easily reproduce the test case. If performance becomes a critical issue in the application, I might feel justified in trading maintainability for performance, but that's probably a less typical situation. – danorton Jun 25 '11 at 06:43
  • @Tomalak Good point vis-a-vis this being a command line. Premature optimization on my part, I guess. That said, I can't help but feel that pushing an object through a string is just as bad as manipulating the string for XML before parsing it. – cwallenpoole Jun 25 '11 at 14:11
  • @danorton Well, it really shouldn't matter. Tomalak's answer is a solid one. If I were in your circumstances, I would definitely go with it. I'd also take his suggestion about detecting the default namespace of the DOMDocument – cwallenpoole Jun 25 '11 at 14:13
  • @cwallenpoole: Serializing an XML DOM to string and re-parsing it may be less efficient than manipulating the source and only parsing it once. Then again, you just don't do string manipulation on XML. That is not "just as worse", it's out of the question. The world would be a happier place if nobody did things like this. ;) – Tomalak Jun 25 '11 at 14:26
  • I needed the namespace in one place, but a fragment couldn't have it. I wasn't able to get rid of the namespace with registering and new docs. String functions were needed, because saveXML would put the namespace back. – VectorVortec Nov 25 '17 at 01:49
0

Also as a variant you may use a xpath mask:

//*[local-name(.) = 'MainType'][@ID='123']
Tertium
  • 6,049
  • 3
  • 30
  • 51
0

Given the current state of the XPath language, I feel that the best answer is provided by Tomalek: to associate a prefix with the default namespace and to prefix all tag names. That’s the solution I intend to use in my current application.

When that’s not possible or practical, a better solution than my hack is to invoke a method that does the same thing as re-scanning (hopefully more efficiently): DOMDocument::normalizeDocument(). The method behaves “as if you saved and then loaded the document, putting the document in a ‘normal’ form.”

danorton
  • 11,804
  • 7
  • 44
  • 52