19

since I had this annoying issue for the 2nd time, I thought that asking would help.

Sometimes I have to get Elements from XML documents, but the ways to do this are awkward.

I’d like to know a python library that does what I want, a elegant way to formulate my XPaths, a way to register the namespaces in prefixes automatically or a hidden preference in the builtin XML implementations or in lxml to strip namespaces completely. Clarification follows unless you already know what I want :)

Example-doc:

<root xmlns="http://really-long-namespace.uri"
  xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>

What I can do

The ElementTree API is the only builtin one (I know of) providing XPath queries. But it requires me to use “UNames.” This looks like so: /{http://really-long-namespace.uri}root/{http://with-ambivalent.end/#}elem

As you can see, these are quite verbose. I can shorten them by doing the following:

default_ns = "http://really-long-namespace.uri"
other_ns   = "http://with-ambivalent.end/#"
doc.find("/{{{0}}}root/{{{1}}}elem".format(default_ns, other_ns))

But this is both {{{ugly}}} and fragile, since http…end/#http…end#http…end/http…end, and who am I to know which variant will be used?

Also, lxml supports namespace prefixes, but it does neither use the ones in the document, nor provides an automated way to deal with default namespaces. I would still have to get one element of each namespace to retrieve it from the document. Namespace attributes are not preserved, so no way of automatically retrieving them from these, too.

There is a namespace-agnostic way of XPath queries, too, but it is both verbose/ugly and unavailable in the builtin implementation: /*[local-name() = 'root']/*[local-name() = 'elem']

What I want to do

I want to find a library, option or generic XPath-morphing function to achieve above examples by typing little more than the following…

  1. Unnamespaced: /root/elem
  2. Namespace-prefixes from document: /root/other:elem

…plus maybe some statements that i indeed want to use the document’s prefixes or strip the namespaces.

Further clarification: although my current use case is as simple as that, I will have to use more complex ones in the future.

Thanks for reading!


Solved

The user samplebias directed my attention to py-dom-xpath; Exactly what i was looking for. My actual code now looks like this:

#parse the document into a DOM tree
rdf_tree = xml.dom.minidom.parse("install.rdf")
#read the default namespace and prefix from the root node
context = xpath.XPathContext(rdf_tree)

name    = context.findvalue("//em:id", rdf_tree)
version = context.findvalue("//em:version", rdf_tree)

#<Description/> inherits the default RDF namespace
resource_nodes = context.find("//Description/following-sibling::*", rdf_tree)

Consistent with the document, simple, namespace-aware; perfect.

flying sheep
  • 8,475
  • 5
  • 56
  • 73
  • 1
    I think you should read http://stackoverflow.com/questions/8692/how-to-use-xpath-in-python –  Apr 06 '11 at 22:46
  • 2
    and i think you should read my question. – flying sheep Apr 07 '11 at 00:01
  • 1
    You are right. I read it closely and it seems you want to define other language that is not more XPath translating any name test that is just a NCName test (i.e. `/root`) into a local name test (i.e. `/*[local-name()='root']`) and any QName test (i.e. `/other:elem`) into a source name test (i.e. `/*[local-name()='elem'][name(namespace::*[.=namespace-uri(..)])='other']`). **But again: that would not be XPath**. –  Apr 07 '11 at 02:40
  • 1
    After some research, i found out that it wouldn’t be XPath 1.0 but rather XPath 2.0. See [this](http://www.w3.org/TR/xpath20/#dt-static-namespaces) and following definitions. – flying sheep Apr 07 '11 at 10:20
  • 1
    Niether XPath 1.0 nor XPath 2.0. The use of default namespace for unqualified XPath 2.0 expression does not mean that you don't need to declare such namespace URI (including the null namespace URI, wich is the default value): `/root` in XPath 2.0 means _"the root element `root` under the default namespace **in my evaluation context**"_ not in the source document. –  Apr 07 '11 at 12:48

2 Answers2

14

The *[local-name() = "elem"] syntax should work, but to make it easier you can create a function to simplify construction of the partial or full "wildcard namespace" XPath expressions.

I'm using python-lxml 2.2.4 on Ubuntu 10.04 and the script below works for me. You'll need to customize the behavior depending on how you want to specify the default namespaces for each element, plus handle any other XPath syntax you want to fold into the expression:

import lxml.etree

def xpath_ns(tree, expr):
    "Parse a simple expression and prepend namespace wildcards where unspecified."
    qual = lambda n: n if not n or ':' in n else '*[local-name() = "%s"]' % n
    expr = '/'.join(qual(n) for n in expr.split('/'))
    nsmap = dict((k, v) for k, v in tree.nsmap.items() if k)
    return tree.xpath(expr, namespaces=nsmap)

doc = '''<root xmlns="http://really-long-namespace.uri"
    xmlns:other="http://with-ambivalent.end/#">
    <other:elem/>
</root>'''

tree = lxml.etree.fromstring(doc)
print xpath_ns(tree, '/root')
print xpath_ns(tree, '/root/elem')
print xpath_ns(tree, '/root/other:elem')

Output:

[<Element {http://really-long-namespace.uri}root at 23099f0>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]
[<Element {http://with-ambivalent.end/#}elem at 2309a48>]

Update: If you find out you do need to parse XPaths, you can check out projects like py-dom-xpath which is a pure Python implementation of (most of) XPath 1.0. In the least that will give you some idea of the complexity of parsing XPath.

samplebias
  • 37,113
  • 6
  • 107
  • 103
  • thanks, that’s really useful for short expression like mine. can you tell me if (and if yes: when) it starts to break? also, i’d like to wait for a more generic solution. should i accept your answer (since it solved my use case) or shouldn’t i (since it does not answer the question generically, i.e. for any XPath)? – flying sheep Apr 07 '11 at 00:00
  • Creating a parser to cover the entire XPath syntax will not be easy, so if you have a small set of simple queries (like in the above example) you'd be better off keeping things simple. If you find out you do need to parse XPaths, you can check out projects like [py-dom-xpath](http://code.google.com/p/py-dom-xpath/) which is a pure Python implementation of (most of) XPath 1.0. In the least that will give you some idea of the complexity of parsing XPath. – samplebias Apr 07 '11 at 01:50
  • py-dom-xpath [does](https://py-dom-xpath.googlecode.com/svn/trunk/doc/index.html) have namespace wildcards and default namespaces builtin, so this is my answer. thanks! (edit it into your answer and i’ll accept it) – flying sheep Apr 07 '11 at 10:14
2

First, about "what you want to do":

  1. Unnamespaced: /root/elem -> no problem here I presume
  2. Namespace-prefixes from document: /root/other:elem -> well, that's a bit of a problem, you cannot just use "namespace-prefixes from document". Even within one document:
    • namespaced elements do not necessarily even have a prefix
    • the same prefix isn't necessarily always mapped to the same namespace uri
    • the same namespace uri doesn't necessarily always have the same prefix

FYI: if you want to get to the prefix mappings in scope for a certain element, try elem.nsmap in lxml. Also, the iterparse and iterwalk methods in lxml.etree can be used to be "notified" of namespace declarations.

Steven
  • 28,002
  • 5
  • 61
  • 51
  • I can at least use namespace-prefixes from the document *element*, or use a namespace wildcard; Both with the same library. Check the solution I edited into my question. – flying sheep Apr 08 '11 at 14:19
  • 1
    @flying sheep: don't do that unless you are *really* sure that the documents you are processing are consistent in their use of namespace prefixes. (referring to your example above: if you are sure all your documents have the same default namespace, and they all define the "em" prefix and map it to the same uri) – Steven Apr 08 '11 at 20:34
  • i am. i’m currently working with mozilla rdf files. They are XML-serialized RDF files with the additional “em” relations: http://www.mozilla.org/2004/em-rdf# – flying sheep Apr 10 '11 at 11:46
  • the FYI part was really interesting, though. and the namespace wildcards are “safe”, if i don’t know which prefixes are used. – flying sheep Apr 10 '11 at 12:47