9

I have an XML document that has a default namespace attached to it, eg

<foo xmlns="http://www.example.com/ns/1.0">
...
</foo>

In reality this is a complex XML document that conforms to a complex schema. My job is to parse out some data from it. To aid me, I have a spreadsheet of XPath. The XPath is rather deeply nested, eg

level1/level2/level3[@foo="bar"]/level4[@foo="bar"]/level5/level6[2]

The person who generate the XPath is an expert in the schema, so I am going with the assumption that I can't simplify it, or use object traversal shortcuts.

I am using SimpleXML to parse everything out. My problem has to do with how the default namespace gets handled.

Since there is a default namespace on the root element, I can't just do

$xml = simplexml_load_file($somepath);
$node = $xml->xpath('level1/level2/level3[@foo="bar"]/level4[@foo="bar"]/level5/level6[2]');

I have to register the namespace, assign it to a prefix, and then use the prefix in my XPath, eg

$xml = simplexml_load_file($somepath);
$xml->registerXPathNamespace('myns', 'http://www.example.com/ns/1.0');
$node = $xml->xpath('myns:level1/myns:level2/myns:level3[@foo="bar"]/myns:level4[@foo="bar"]/myns:level5/myns:level6[2]');

Adding the prefixes isn't going to be manageable in the long run.

Is there a proper way to handle default namespaces without needing to using prefixes with XPath?

Using an empty prefix doesn't work ($xml->registerXPathNamespace('', 'http://www.example.com/ns/1.0');). I can string out the default namespace, eg

$xml = file_get_contents($somepath);
$xml = str_replace('xmlns="http://www.example.com/ns/1.0"', '', $xml);
$xml = simplexml_load_string($xml);

but that is skirting the issue.

mpdonadio
  • 2,891
  • 3
  • 35
  • 54
  • What do you mean by "adding the prefixes isn't going to be manageable in the long run"? Why is that? – JLRishe Jan 15 '14 at 17:18
  • @JLRishe I tried to simplify the question as much as possible. The XPath is currently in an XLS. We may end up automating the process, so the system will read the XLS, a directory of XML files, and then injest all of the data mappings. I'm seeing adding the prefixes to the XPath via code as being error prone. – mpdonadio Jan 15 '14 at 17:23
  • Can the process you use to produce the XLS be modified to have the XPaths include prefixes? – JLRishe Jan 15 '14 at 17:40
  • @JLRishe Again, another simplification. The XLS will be coming from third-party (with input from a fourth-party), and the XPath is already in their system. I don't see any part of that process changing, so my issue really does have to do with how SimpleXML and XPath work with default namespaces. – mpdonadio Jan 15 '14 at 17:48

3 Answers3

12

From a bit of reading online, this is not restricted to any particular PHP or other library, but to XPath itself - at least in XPath version 1.0

XPath 1.0 does not include any concept of a "default" namespace, so regardless of how the element names appear in the XML source, if they have a namespace bound to them, the selectors for them must be prefixed in basic XPath selectors of the form ns:name. Note that ns is a prefix defined within the XPath processor, not by the document being processed, so has no relationship to how xmlns attributes are used in the XML representation.

See e.g. this "common XSLT mistakes" page, talking about the closely related XSLT 1.0:

To access namespaced elements in XPath, you must define a prefix for their namespace. [...] Unfortunately, XSLT version 1.0 has no concept similar to a default namespace; therefore, you must repeat namespace prefixes again and again.

According to an answer to a similar question, XPath 2.0 does include a notion of "default namespace", and the XSLT page linked above mentions this also in the context of XSLT 2.0.

Unfortunately, all of the built-in XML extensions in PHP are built on top of the libxml2 and libxslt libraries, which support only version 1.0 of XPath and XSLT.

So other than pre-processing the document not to use namespaces, your only option would be to find an XPath 2.0 processor that you could plug in to PHP.

(As an aside, it's worth noting that if you have unprefixed attributes in your XML document, they are not technically in the default namespace, but rather in no namespace at all; see XML Namespaces and Unprefixed Attributes for discussion of this oddity of the Namespace spec.)

Community
  • 1
  • 1
IMSoP
  • 89,526
  • 13
  • 117
  • 169
2

Is there a proper way to handle default namespaces without needing to using prefixes with XPath?

No. The proper way to handle any namespace is to associate some value (a prefix) with that namespace so that it can be explicitly selected in the XPath expression. The default namespace is no different.

Think about it this way: an element in some namespace and another element with the same name in some other namespace (or no namespace at all) are different elements. They could mean (i.e. represent) different things. That's the whole point. You need to tell XPath which one you want to select. Without it, XPath doesn't know what you're asking for.

Adding the prefixes isn't going to be manageable in the long run.

I really don't see why. Whatever creates the XPath expression should be capable of specifying a proper XPath expression (or it's a broken tool).

You might be thinking, "why can't I just ignore the namespace and get all elements matching that name?" There are really hacky ways to do this (like the XSLT-based answer already posted), but they are broken by design. An element in XML is identified by the combination of its namespace and local name, just as your house can be identified with a street number (the local name) in some city and state (the namespace). If I tell you that I live on 422 Main St, then you still have no idea where I live until I tell you which city and state.

You still might be thinking, "enough with the stupid analogies, I really, really want to do this anyway." You can select elements with a given name across all namespaces by matching only the local name portion of the element, like this:

*[local-name()='level1']/*[local-name()='level2']
    /*[local-name()='level3' and @foo="bar"]/*[local-name()='level4' and 
        @foo="bar"]/*[local-name()='level5']/*[local-name()='level6'][2]');

Note that this does not restrict itself to the default namespace. It ignores namespaces entirely. It's ugly and I don't recommend it, but sometimes you just want to ignore what's best and get something done.

By the way, this is not PHP's fault. This is what the XPath spec requires. You have to specify a prefix to select a node in a namespace. If PHP were to allow you to do it some other way, then whatever they called it, it would no longer be XPath (according to the spec).

Wayne
  • 59,728
  • 15
  • 131
  • 126
  • Thanks, I get the namespace analogies. I am mostly confused by the way PHP handles this. If there is a default namespace on a document, then I can use SimpleXML's object traversal to get to elements w/o explicitly giving a namespace or using the `$ns` parameter on various methods. However, if I want to use the `->xpath` method on the same document in SimpleXML, I need to register the namespace and assign it a prefix. – mpdonadio Jan 15 '14 at 21:02
  • 3
    It's not PHP's fault. This is what the XPath spec requires. You *have* to specify a prefix to select a node in a namespace. If PHP were to allow you to do it some other way, then whatever they called it, it would no longer be XPath (according to the spec). – Wayne Jan 15 '14 at 21:25
  • 1
    so what's the syntax to assign a prefix to a namespace that doesn't have a prefix? – ahnbizcad Apr 19 '17 at 19:17
0

In order to avoid hacks like the str_replace one you have there (and I would recommend avoiding that), you can run the XML files through an XSLT to strip out the namespace:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:myns="http://www.example.com/ns/1.0">
  <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>

  <xsl:template match="@* | node()">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()" />
    </xsl:copy>
  </xsl:template>

  <xsl:template match="myns:*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@* | node()" />
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>

When run on either of these inputs:

<foo xmlns="http://www.example.com/ns/1.0">
  <a>
    <child attr="5"></child>
  </a>
</foo>

<ex:foo xmlns:ex="http://www.example.com/ns/1.0">
  <ex:a>
    <ex:child attr="5"></ex:child>
  </ex:a>
</ex:foo>

The output is the same:

<foo>
  <a>
    <child attr="5" />
  </a>
</foo>

This would allow you to use your prefix-less XPaths on the result.

JLRishe
  • 99,490
  • 19
  • 131
  • 169
  • PHP's DOM API could give the same result in a couple of lines, if stripping the namespace (declaration and prefix) is all that is wanted. – salathe Jan 15 '14 at 22:44
  • 1
    @salathe If that's the case, then by all means, please enlighten us. – JLRishe Jan 16 '14 at 05:10