0

Okay, this is starting to drive me a little bit nuts. I've tried several xml/xpath libraries for Python, and can't figure out a simple way to get a stinkin' "title" element.

The latest attempt looks like this (using Amara):

def view(req, url):
    req.content_type = 'text/plain'
    doc = amara.parse(urlopen(url))
    for node in doc.xml_xpath('//title'):
    req.write(str(node)+'\n')

But that prints out nothing. My XML looks like this: http://programanddesign.com/feed/atom/

If I try //* instead of //title it returns everything as expected. I know that the XML has titles in there, so what's the problem? Is it the namespace or something? If so, how can I fix it?


Can't seem to get it working with no prefix, but this does work:

def view(req, url):
    req.content_type = 'text/plain'
    doc = amara.parse(url, prefixes={'atom': 'http://www.w3.org/2005/Atom'})
    req.write(str(doc.xml_xpath('//atom:title')))
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • You can get rid of the namespace by asking whoever generates the XML to get rid of it. Otherwise you need to deal with it. This can be done by editing the file, but then again, you can handle the whole file with regexps, or simple "finds" in Python as well... But the robust way of handling XML is with an XML parser. Including namespaces. – Lennart Regebro Oct 18 '09 at 07:26
  • On a side note, this question already ranks on page 1 on google for the query "amara get root node"... in under an hour, sheesh – mpen Oct 18 '09 at 08:09

2 Answers2

1

You probably just have to take into account the namespace of the document which you're dealing with.

I'd suggest looking up how to deal with namespaces in Amara:

http://www.xml3k.org/Amara/Manual#namespaces

Edit: Using your code snippet I made some edits. I don't know what version of Amara you're using but based on the docs I tried to accommodate it as much as possible:

def view(req, url):
    req.content_type = 'text/plain'
    ns = {u'f' : u'http://www.w3.org/2005/Atom',
        u't' : u'http://purl.org/syndication/thread/1.0'}
    doc = amara.parse(urlopen(url), prefixes=ns)
    req.write(str(doc.xml_xpath(u'f:title')))
meder omuraliev
  • 183,342
  • 71
  • 393
  • 434
  • That doesn't really help me. What if I don't know the namespaces beforehand? What if I don't really care what the namespace is? – mpen Oct 18 '09 at 06:56
  • 1
    You said your xml document is similar to the one you linked to. The one you linked to contains a namespace. There's a *reason* namespaces are used - of course you can get rid of your namespace from your xml document, then you don't have to worry about it. Otherwise you must account for it. – meder omuraliev Oct 18 '09 at 07:00
  • @"care what the namespace is" - you could probably parse the xmlns attribute and just register that value. – meder omuraliev Oct 18 '09 at 07:05
  • Okay... supposing I *did* know what the namespaces were....setting them to "none" still doesn't work. – mpen Oct 18 '09 at 07:31
  • I have no idea. I just snagged a fresh copy off the ubuntu repositories, so it should be pretty recent/up to date. Got it working now... I guess I can't have no prefix, which I still think is dumb, but I guess I can work with it. – mpen Oct 18 '09 at 07:52
  • Similar yes. I edited the question again, it's more or less the same. – mpen Oct 18 '09 at 07:55
  • "setting them to "none" still doesn't work." No, you need to filter them out of the XML source if you want to get rid of them. – Lennart Regebro Oct 18 '09 at 07:58
  • 1
    Right, you can't avoid dealing with namespaces unless you alter the original xml source. The whole point of namespaces I believe is to avoid possible conflicts with duplicate node names/attributes. It's in the XPath and XML standards, which I would obey if I were using those technologies. – meder omuraliev Oct 18 '09 at 08:01
  • 1
    When writing XML you can define a namespace for the whole document at the top. So why not when querying XML can't you specify a default namespace to use? I'm 98% sure other libraries let you do this... – mpen Oct 18 '09 at 08:11
  • @Mark: In XPath, there is no such thing as a default namespace. I guess it would be possible to fake it, though, and maybe some libraries does that. – Lennart Regebro Oct 18 '09 at 08:27
  • In .NET's XML library, you can create an `XmlNamespaceManager` that maps the empty string to one namespace and a prefix to the empty namespace. I don't know why `lxml` doesn't support this. – Robert Rossney Oct 18 '09 at 17:35
1

It is indeed the namespaces. It was a bit tricky to find in the lxml docs, but here's how you do it:

from lxml import etree
doc = etree.parse(open('index.html'))
doc.xpath('//default:title', namespaces={'default':'http://www.w3.org/2005/Atom'})

You can also do this:

title_finder = etree.ETXPath('//{http://www.w3.org/2005/Atom}title')
title_finder(doc)

And you'll get the titles back in both cases.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • What if I don't know the namespaces beforehand? I just want to get rid of em. They might even be defined half-way through the document (on a div or something). – mpen Oct 18 '09 at 07:10
  • why can't you just parse the xmlns attribute? – meder omuraliev Oct 18 '09 at 07:12
  • 2
    XML is a fully generic data exchange protocol. If you don't know the format you typically can't do very much useful things with the data, as you don't know what the data means. Also if you don't know the structure beforehand, then you MUST take care of and parse namespaces wherever they appear. That however is a generalized XML parsing problem, and highly unlikely to be the case. So I think you do know quite a bit of the structure, including either what the namespaces are, or where they are likely to be defined. So: Not a problem. – Lennart Regebro Oct 18 '09 at 07:17
  • 2
    Knowing the structure may include knowledge that the namespace is always the same and can always be safely ignored. In that case, you can filter it out from the document first. But then again, in that case you know there is a namespace there, and you do care, since you filter it out. And then you might as well just include it in the xpath methods. – Lennart Regebro Oct 18 '09 at 07:19
  • If you are using `lxml` then you can get the namespaces with `doc.getroot().nsmap` and then pass that to your `.xpath()` call. Except that it won't help for the default namespace which you will still have to add manually. – ccpizza May 18 '18 at 20:04