1

E.g. consider parsing a pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

    <parent>
        <groupId>com.parent</groupId>
        <artifactId>parent</artifactId>
        <version>1.0-SNAPSHOT</version>
        <relativePath>../pom.xml</relativePath>
    </parent>

    <modelVersion>2.0.0</modelVersion>
    <groupId>com.parent.somemodule</groupId>
    <artifactId>some_module</artifactId>
    <packaging>jar</packaging>
    <version>1.0-SNAPSHOT</version>
    <name>Some Module</name>
    ...

Code:

import xml.etree.ElementTree as ET

tree = ET.parse(pom)
root = tree.getroot()

groupId = root.find("groupId")
artifactId = root.find("artifactId")

Both groupId and artifactId are None. Why when they are the direct descendants of the root? I tried to replace the root with tree (groupId = tree.find("groupId")) but that didn't change anything.

amphibient
  • 29,770
  • 54
  • 146
  • 240
  • 1
    possible duplicate of [Parsing XML with namespace in Python ElementTree](http://stackoverflow.com/questions/14853243/parsing-xml-with-namespace-in-python-elementtree) – Martijn Pieters Jan 15 '14 at 19:38

2 Answers2

4

The problem is that you don't have a child named groupId, you have a child named {http://maven.apache.org/POM/4.0.0}groupId, because etree doesn't ignore XML namespaces, it uses "universal names". See Working with Namespaces and Qualified Names in the effbot docs.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • can i somehow make it ignore the namespace? – amphibient Jan 15 '14 at 19:31
  • @amphibient: Not directly, no. If you read the doc page I linked, it shows you the various ways of dealing with this correctly. – abarnert Jan 15 '14 at 19:32
  • 1
    @amphibient: It's not retarded; XML that uses namespaces to resolve ambiguity problems would be broken if you ignored them. (XML as a whole is kind of retarded, but that's a different story…) For quick&dirty scripts, you want a quick&dirty parser like `BeautifulSoup`, not a parser that tries to be correct. – abarnert Jan 15 '14 at 19:35
  • @amphibient: Anyway, I could give you code to solve your problem, but if you don't actually understand namespaces and universal names, that code won't do you any good, so you pretty much have to read that document. If you have any questions afterward, I can help. – abarnert Jan 15 '14 at 19:35
  • what i consider "retarded" is the inability to disregard the namespace and use it as though the root were simply `` and not ``. why wouldn't there be a feature to ignore it for simpler processing? – amphibient Jan 15 '14 at 19:38
  • @amphibient: Because that would be incorrect as often as it would be useful. It's like saying Python is retarded for not letting you write `'answer: ' + 42`. Sure, that would sometimes be useful, but it would also be an attractive nuisance (as languages like PHP and Tcl prove). – abarnert Jan 15 '14 at 19:50
1

Just to expand on abarnert's comment about BeautifulSoup, if you DO just want a quick and dirty solution to the problem, this is probably the fastest way to go about it. I have implemented this (for a personal script) that uses bs4, where you can traverse the tree with

element = dom.getElementsByTagNameNS('*','elementname')

This will reference the dom using ANY namespace, handy if you know you've only got one in the file so there's no ambiguity.

Sean K.
  • 45
  • 6