50

How does XPath deal with XML namespaces?

If I use

/IntuitResponse/QueryResponse/Bill/Id

to parse the XML document below I get 0 nodes back.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<IntuitResponse xmlns="http://schema.intuit.com/finance/v3" 
                time="2016-10-14T10:48:39.109-07:00">
    <QueryResponse startPosition="1" maxResults="79" totalCount="79">
        <Bill domain="QBO" sparse="false">
            <Id>=1</Id>
        </Bill>
    </QueryResponse>
</IntuitResponse>

However, I'm not specifying the namespace in the XPath (i.e. http://schema.intuit.com/finance/v3 is not a prefix of each token of the path). How can XPath know which Id I want if I don't tell it explicitly? I suppose in this case (since there is only one namespace) XPath could get away with ignoring the xmlns entirely. But if there are multiple namespaces, things could get ugly.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Adam
  • 8,752
  • 12
  • 54
  • 96
  • Your XPath should not return any node : [INFO - XPath returned 0 items (compiled in 0ms, evaluated in 1ms)](http://www.xpathtester.com/xpath/db27cc0f978057b5faabc32c6aa149c8). How did you execute the XPath? – har07 Nov 25 '16 at 00:57
  • @har07 I did it in Java using import `javax.xml.xpath.XPath`. I agree it doesn't work using an online tester. That was one of the perplexing things. – Adam Nov 25 '16 at 01:16
  • Excellent question! XPath itself provides no way to specify a default namespace or the binding of a namespace prefix to a namespace. Fortunately, however, hosting languages and libraries do. [**See my answer below for details**](http://stackoverflow.com/a/40796315/290085)... – kjhughes Nov 25 '16 at 01:17
  • Not quite sure why a question should be upvoted so highly when it has been asked and answered 1000 times before.... – Michael Kay Nov 25 '16 at 08:52
  • 2
    I for one was impressed with this question because, unlike most previous askers, Adam not only included a [mcve], he sensed and conveyed the need for XPath to deal with XML namespaces *somehow*. Most such questions merely post an XPath, maybe some XML (and if we're lucky it's not an image or a link to a humongous off-site resource), and state that it "doesn't work." Adam sensed it had to do with namespaces, nailed the title, and wrote what I considered to be a question worthy of a canonical answer. – kjhughes Nov 25 '16 at 15:59
  • 1
    Possible duplicate of [How to query XML using namespaces in Java with XPath?](https://stackoverflow.com/questions/6390339/how-to-query-xml-using-namespaces-in-java-with-xpath) – rogerdpack Feb 06 '19 at 23:33

2 Answers2

79

XPath 1.0/2.0

Defining namespaces in XPath (recommended)

XPath itself doesn't have a way to bind a namespace prefix with a namespace. Such facilities are provided by the hosting library.

It is recommended that you use those facilities and define namespace prefixes that can then be used to qualify XML element and attribute names as necessary.


Here are some of the various mechanisms which XPath hosts provide for specifying namespace prefix bindings to namespace URIs.

(OP's original XPath, /IntuitResponse/QueryResponse/Bill/Id, has been elided to /IntuitResponse/QueryResponse.)

C#:

XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
nsmgr.AddNamespace("i", "http://schema.intuit.com/finance/v3");
XmlNodeList nodes = el.SelectNodes(@"/i:IntuitResponse/i:QueryResponse", nsmgr);

Google Docs:

Unfortunately, IMPORTXML() does not provide a namespace prefix binding mechanism. See next section, Defeating namespaces in XPath, for how to use local-name() as a work-around.

Java (SAX):

NamespaceSupport support = new NamespaceSupport();
support.pushContext();
support.declarePrefix("i", "http://schema.intuit.com/finance/v3");

Java (XPath):

xpath.setNamespaceContext(new NamespaceContext() {
    public String getNamespaceURI(String prefix) {
      switch (prefix) {
        case "i": return "http://schema.intuit.com/finance/v3";
        // ...
       }
    });

JavaScript:

See Implementing a User Defined Namespace Resolver:

function nsResolver(prefix) {
  var ns = {
    'i' : 'http://schema.intuit.com/finance/v3'
  };
  return ns[prefix] || null;
}
document.evaluate( '/i:IntuitResponse/i:QueryResponse', 
                   document, nsResolver, XPathResult.ANY_TYPE, 
                   null );

Note that if the default namespace has an associated namespace prefix defined, using the nsResolver() returned by Document.createNSResolver() can obviate the need for a customer nsResolver().

Perl (LibXML):

my $xc = XML::LibXML::XPathContext->new($doc);
$xc->registerNs('i', 'http://schema.intuit.com/finance/v3');
my @nodes = $xc->findnodes('/i:IntuitResponse/i:QueryResponse');

Python (lxml):

from lxml import etree
f = StringIO('<IntuitResponse>...</IntuitResponse>')
doc = etree.parse(f)
r = doc.xpath('/i:IntuitResponse/i:QueryResponse', 
              namespaces={'i':'http://schema.intuit.com/finance/v3'})

Python (ElementTree):

namespaces = {'i': 'http://schema.intuit.com/finance/v3'}
root.findall('/i:IntuitResponse/i:QueryResponse', namespaces)

Python (Scrapy):

response.selector.register_namespace('i', 'http://schema.intuit.com/finance/v3')
response.xpath('/i:IntuitResponse/i:QueryResponse').getall()

PhP:

Adapted from @Tomalak's answer using DOMDocument:

$result = new DOMDocument();
$result->loadXML($xml);

$xpath = new DOMXpath($result);
$xpath->registerNamespace("i", "http://schema.intuit.com/finance/v3");

$result = $xpath->query("/i:IntuitResponse/i:QueryResponse");

See also @IMSoP's canonical Q/A on PHP SimpleXML namespaces.

Ruby (Nokogiri):

puts doc.xpath('/i:IntuitResponse/i:QueryResponse',
                'i' => "http://schema.intuit.com/finance/v3")

Note that Nokogiri supports removal of namespaces,

doc.remove_namespaces!

but see the below warnings discouraging the defeating of XML namespaces.

VBA:

xmlNS = "xmlns:i='http://schema.intuit.com/finance/v3'"
doc.setProperty "SelectionNamespaces", xmlNS  
Set queryResponseElement =doc.SelectSingleNode("/i:IntuitResponse/i:QueryResponse")

VB.NET:

xmlDoc = New XmlDocument()
xmlDoc.Load("file.xml")
nsmgr = New XmlNamespaceManager(New XmlNameTable())
nsmgr.AddNamespace("i", "http://schema.intuit.com/finance/v3");
nodes = xmlDoc.DocumentElement.SelectNodes("/i:IntuitResponse/i:QueryResponse",
                                           nsmgr)

SoapUI (doc):

declare namespace i='http://schema.intuit.com/finance/v3';
/i:IntuitResponse/i:QueryResponse

xmlstarlet:

-N i="http://schema.intuit.com/finance/v3"

XSLT:

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:i="http://schema.intuit.com/finance/v3">
   ...

Once you've declared a namespace prefix, your XPath can be written to use it:

/i:IntuitResponse/i:QueryResponse

Defeating namespaces in XPath (not recommended)

An alternative is to write predicates that test against local-name():

/*[local-name()='IntuitResponse']/*[local-name()='QueryResponse']

Or, in XPath 2.0:

/*:IntuitResponse/*:QueryResponse

Skirting namespaces in this manner works but is not recommended because it

  • Under-specifies the full element/attribute name.

  • Fails to differentiate between element/attribute names in different namespaces (the very purpose of namespaces). Note that this concern could be addressed by adding an additional predicate to check the namespace URI explicitly:

     /*[    namespace-uri()='http://schema.intuit.com/finance/v3' 
        and local-name()='IntuitResponse']
     /*[    namespace-uri()='http://schema.intuit.com/finance/v3' 
        and local-name()='QueryResponse']
    

    Thanks to Daniel Haley for the namespace-uri() note.

  • Is excessively verbose.

XPath 3.0/3.1

Libraries and tools that support modern XPath 3.0/3.1 allow the specification of a namespace URI directly in an XPath expression:

/Q{http://schema.intuit.com/finance/v3}IntuitResponse/Q{http://schema.intuit.com/finance/v3}QueryResponse

While Q{http://schema.intuit.com/finance/v3} is much more verbose than using an XML namespace prefix, it has the advantage of being independent of the namespace prefix binding mechanism of the hosting library. The Q{} notation is known as Clark Notation after its originator, James Clark. The W3C XPath 3.1 EBNF grammar calls it a BracedURILiteral.

Thanks to Michael Kay for the suggestion to cover XPath 3.0/3.1's BracedURILiteral.

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thank you for such a complete answer. One thing I still don't understand though is how when I use a library like Javax or Pugi XML to parse the XML above with the path I specified, I actually do get results (i.e. a node list) back. Do some of these libraries have an ability to somehow infer simple namespaces? – Adam Nov 25 '16 at 17:42
  • I think what it's doing is just ignoring namespaces entirely. If I use `/IntuitResponse/QueryResponse/Bill/Id` without registering a namespace, pugi/javax retrieve *all* `Id`s in the document. – Adam Nov 25 '16 at 19:02
  • Is this the same pugixml that states in its 1.8 manual, [8.6. Conformance to W3C specification](http://pugixml.org/docs/manual.html#xpath.w3c), that it *does not provide a fully conformant XPath 1.0 implementation*? If so, that would explain the behavior you're seeing, and, if so, I would recommend avoiding it and all other non-conformant processors. – kjhughes Nov 25 '16 at 19:35
  • I couldn't see anything is that section that fully explains why it would retrieve all `Id`s when namespace isn't specified. Am I missing something, or should I just chalk the behaviour up to "pugi is not fully conformant to the W3C specs and so does weird stuff sometimes". But that doesn't explain why when using the Javax library for XPath I get the same behaviour. – Adam Nov 25 '16 at 19:44
  • 2
    pugi: nonconformance declaration in docs + odd behavior observation = turn and run / life's too short. Javax: Don't forget to call [**`setNamespaceAware(true)`**](http://docs.oracle.com/javase/6/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setNamespaceAware%28boolean%29) on the `DocumentBuilderFactory`. – kjhughes Nov 25 '16 at 19:56
  • 2
    Turns out pugixml doesn't support xml namespaces at all (http://stackoverflow.com/questions/1042855/using-boost-to-read-and-write-xml-files). Turning and running. – Adam Nov 25 '16 at 21:10
  • 1
    Thanks, very helpful. – Doug Glancy Nov 20 '17 at 04:18
  • 2
    Gem of an answer. – undetected Selenium Aug 24 '20 at 10:14
  • damn fine answer sir. i would add, if you want to use xpath axis (at least with `lxml`) you can just insert it before the element name like `root.xpath("child::foo:element", namepsace={"foo": "bar"})` this is not documented in the `lxml` docs. – Edward Feb 03 '21 at 12:03
  • 2
    It might be worth adding that in XPath 3.0/3.1, instead of using `foo:element` and defining a binding for `foo` in the external API, you can write `Q{http://example.com/ns}element` which avoids use of namespace prefixes entirely. This is most useful when the extra verbosity doesn't matter, for example if the XPath code is software-generated. – Michael Kay Dec 09 '22 at 16:15
  • @MichaelKay: Good idea! Answer updated to describe XPath 3.0/3.1's new `BracedURILiteral` syntax. Thank you. – kjhughes Dec 09 '22 at 16:57
-1

I use /*[name()='...'] in a google sheet to fetch some counts from Wikidata. I have a table like this

 thes    WD prop links   items
 NOM     P7749   3925    3789
 AAT     P1014   21157   20224

and the formulas in cols links and items are

=IMPORTXML("https://query.wikidata.org/sparql?query=SELECT(COUNT(*)as?c){?item wdt:"&$B14&"[]}","//*[name()='literal']")
=IMPORTXML("https://query.wikidata.org/sparql?query=SELECT(COUNT(distinct?item)as?c){?item wdt:"&$B14&"[]}","//*[name()='literal']")

respectively. The SPARQL query happens not to have any spaces...

I saw name() used instead of local-name() in Xml Namespace breaking my xpath!, and for some reason //*:literal doesn't work.

Vladimir Alexiev
  • 2,477
  • 1
  • 20
  • 31