Parsing XML with namespace in Python via 'ElementTree'

Question

I have the following XML which I want to parse using Python's ElementTree:

<rdf:RDF xml:base="http://dbpedia.org/ontology/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns="http://dbpedia.org/ontology/">

    <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
        <rdfs:label xml:lang="en">basketball league</rdfs:label>
        <rdfs:comment xml:lang="en">
          a group of sports teams that compete against each other
          in Basketball
        </rdfs:comment>
    </owl:Class>

</rdf:RDF>

I want to find all owl:Class tags and then extract the value of all rdfs:label instances inside them. I am using the following code:

tree = ET.parse("filename")
root = tree.getroot()
root.findall('owl:Class')

Because of the namespace, I am getting the following error.

SyntaxError: prefix 'owl' not found in prefix map

I tried reading the document at http://effbot.org/zone/element-namespaces.htm but I am still not able to get this working since the above XML has multiple nested namespaces.

Kindly let me know how to change the code to find all the owl:Class tags.

Since Python 3.8, a namespace wildcard can be used with `find()`, `findall()` and `findtext()`. See https://stackoverflow.com/a/62117710/407651. — mzjn, Jul 19 '21 at 19:50

Martijn Pieters · Accepted Answer · 2021-07-19T06:54:34.613

261

You need to give the .find(), findall() and iterfind() methods an explicit namespace dictionary:

namespaces = {'owl': 'http://www.w3.org/2002/07/owl#'} # add more as needed

root.findall('owl:Class', namespaces)

Prefixes are only looked up in the namespaces parameter you pass in. This means you can use any namespace prefix you like; the API splits off the owl: part, looks up the corresponding namespace URL in the namespaces dictionary, then changes the search to look for the XPath expression {http://www.w3.org/2002/07/owl}Class instead. You can use the same syntax yourself too of course:

root.findall('{http://www.w3.org/2002/07/owl#}Class')

Also see the Parsing XML with Namespaces section of the ElementTree documentation.

If you can switch to the lxml library things are better; that library supports the same ElementTree API, but collects namespaces for you in .nsmap attribute on elements and generally has superior namespaces support.

edited Jul 19 '21 at 06:54

answered Feb 13 '13 at 12:18

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks. Especially for the second part, where you can give the namespace directly. – Sudar Feb 13 '13 at 12:35
10

Thank you. Any idea how can I get the namespace directly from XML, without hard-coding it? Or how can I ignore it? I've tried findall('{*}Class') but it wont work in my case. – Kostanos Nov 27 '13 at 01:26
7

You'd have to scan the tree for `xmlns` attributes yourself; as stated in the answer, `lxml` does this for you, the `xml.etree.ElementTree` module does not. But if you are trying to match a specific (already hardcoded) element, then you are also trying to match a specific element in a specific namespace. That namespace is not going to change between documents any more than the element name is. You may as well hardcode that with the element name. – Martijn Pieters Nov 28 '13 at 15:12
15

@Jon: `register_namespace` only influences serialisation, not search. – Martijn Pieters Aug 20 '14 at 07:10
5

Small addition that may be useful: when using `cElementTree` instead of `ElementTree`, `findall` will not take namespaces as a keyword argument, but rather simply as a normal argument, i.e. use `ctree.findall('owl:Class', namespaces)`. – egpbos Sep 30 '14 at 15:18
1

@egpbos: adjusted to be `cElementTree` compatible. – Martijn Pieters Sep 30 '14 at 15:21
Thanks a lot Martijn, where did you find that findall() as an extra argument ? docs.python.org does not mention it. – Bludwarf Mar 22 '15 at 07:50
2

@Bludwarf: The docs do mention it (now, if not when you wrote that), but you have to read them verrrry carefully. See the [Parsing XML with Namespaces](https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml-with-namespaces) section: there's an example contrasting the use of `findall` without and then with the `namespace` argument, but the argument is not mentioned as one of the arguments to the method method in the [Element object](https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects) section. – Wilson F Jun 18 '16 at 20:02
@MartijnPieters, how do I get the value of the attribute `xml:lang` of the `rdfs:label` element? – Alex Raj Kaliamoorthy May 18 '19 at 01:05
Just a reminder. It takes me hours to debug and find that the second parameter in `findtext()` is not namespace. So it needs to be written as `findtext('./prefix:tag', namespaces=prefix_map)` – bjc Feb 24 '21 at 04:04
@bjc more recent Python 3 versions use the Argument Clinic to handle argument parsing for most cEmementTree methods and thus find and findall now support namespace as a keyword argument. – Martijn Pieters Feb 24 '21 at 07:37

Brad Dre · Answer 2 · 2019-07-30T18:47:24.943

67

Here's how to do this with lxml without having to hard-code the namespaces or scan the text for them (as Martijn Pieters mentions):

from lxml import etree
tree = etree.parse("filename")
root = tree.getroot()
root.findall('owl:Class', root.nsmap)

UPDATE:

5 years later I'm still running into variations of this issue. lxml helps as I showed above, but not in every case. The commenters may have a valid point regarding this technique when it comes merging documents, but I think most people are having difficulty simply searching documents.

Here's another case and how I handled it:

<?xml version="1.0" ?><Tag1 xmlns="http://www.mynamespace.com/prefix">
<Tag2>content</Tag2></Tag1>

xmlns without a prefix means that unprefixed tags get this default namespace. This means when you search for Tag2, you need to include the namespace to find it. However, lxml creates an nsmap entry with None as the key, and I couldn't find a way to search for it. So, I created a new namespace dictionary like this

namespaces = {}
# response uses a default namespace, and tags don't mention it
# create a new ns map using an identifier of our choice
for k,v in root.nsmap.iteritems():
    if not k:
        namespaces['myprefix'] = v
e = root.find('myprefix:Tag2', namespaces)

edited Jul 30 '19 at 18:47

answered Nov 07 '14 at 18:22

Brad Dre

3,580
2
19
22

3

The full namespace URL *is* the namespace identifier you're supposed to hard-code. The local prefix (`owl`) can change from file to file. Therefore doing what this answer suggests is a really bad idea. – Matti Virkkunen Mar 18 '16 at 21:53
1

@MattiVirkkunen exactly if the owl definition can change from file to file, shouldn't we use the definition defined in each files instead of hardcoding it? – Loïc Faure-Lacroix Aug 01 '16 at 03:26
1

@LoïcFaure-Lacroix: Usually XML libraries will let you abstract that part out. You don't need to even know or care about the prefix used in the file itself, you just define your own prefix for the purpose of parsing or just use the full namespace name. – Matti Virkkunen Aug 05 '16 at 01:40
this answer helped my to at least be able to use the find function. No need to create your own prefix. I just did key = list(root.nsmap.keys())[0] and then added the key as prefix: root.find(f'{key}:Tag2', root.nsmap) – Eelco van Vliet Dec 10 '19 at 09:30

Davide Brunato · Answer 3 · 2017-02-21T08:15:53.137

45

Note: This is an answer useful for Python's ElementTree standard library without using hardcoded namespaces.

To extract namespace's prefixes and URI from XML data you can use ElementTree.iterparse function, parsing only namespace start events (start-ns):

>>> from io import StringIO
>>> from xml.etree import ElementTree
>>> my_schema = u'''<rdf:RDF xml:base="http://dbpedia.org/ontology/"
...     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
...     xmlns:owl="http://www.w3.org/2002/07/owl#"
...     xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
...     xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
...     xmlns="http://dbpedia.org/ontology/">
... 
...     <owl:Class rdf:about="http://dbpedia.org/ontology/BasketballLeague">
...         <rdfs:label xml:lang="en">basketball league</rdfs:label>
...         <rdfs:comment xml:lang="en">
...           a group of sports teams that compete against each other
...           in Basketball
...         </rdfs:comment>
...     </owl:Class>
... 
... </rdf:RDF>'''
>>> my_namespaces = dict([
...     node for _, node in ElementTree.iterparse(
...         StringIO(my_schema), events=['start-ns']
...     )
... ])
>>> from pprint import pprint
>>> pprint(my_namespaces)
{'': 'http://dbpedia.org/ontology/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#'}

Then the dictionary can be passed as argument to the search functions:

root.findall('owl:Class', my_namespaces)

edited Feb 21 '17 at 08:15

answered May 24 '16 at 09:09

Davide Brunato

723
6
8

2

This is useful for those of us without access to lxml and without wanting to hardcode namespace. – delrocco Jun 06 '16 at 02:41
1

I got the error:`ValueError: write to closed` for this line `filemy_namespaces = dict([node for _, node in ET.iterparse(StringIO(my_schema), events=['start-ns'])])`. Any idea wants wrong? – Yuli Feb 20 '17 at 12:03
Probably the error is related with the class io.StringIO, that refuses ASCII strings. I had tested my recipe with Python3. Adding the unicode string prefix 'u' to the sample string it works also with Python 2 (2.7). – Davide Brunato Feb 21 '17 at 08:23
Instead of `dict([...])` you can also use dict comprehension. – Arminius Nov 01 '17 at 21:07
Instead of `StringIO(my_schema)` you can also put the filename of the XML file. – JustAC0der Jun 29 '18 at 18:57
1

This is exactly what I was looking for! Thank you! – tjwrona1992 Jan 15 '21 at 03:53
Where is `root` defined, that calls `findall()`? – Timo Jun 24 '21 at 19:02
No, iterparse() is not related with find/findall/finditer. It uses the XML parser to iterate over tree nodes, including the start and the end (so the scope) of namespaces declarations. – Davide Brunato Jun 26 '21 at 07:42

score 7 · Answer 4 · answered Aug 16 '16 at 09:51

I've been using similar code to this and have found it's always worth reading the documentation... as usual!

findall() will only find elements which are direct children of the current tag. So, not really ALL.

It might be worth your while trying to get your code working with the following, especially if you're dealing with big and complex xml files so that that sub-sub-elements (etc.) are also included. If you know yourself where elements are in your xml, then I suppose it'll be fine! Just thought this was worth remembering.

root.iter()

ref: https://docs.python.org/3/library/xml.etree.elementtree.html#finding-interesting-elements "Element.findall() finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes:"

The ElementTree documentation is a bit unclear and easy to misunderstand, IMHO. It **is** possible to get all descendants. Instead of `elem.findall("X")`, use `elem.findall(".//X")`. — mzjn, Dec 09 '21 at 08:51

Bram Vanroy · Answer 5 · 2019-10-11T08:33:15.753

7

To get the namespace in its namespace format, e.g. {myNameSpace}, you can do the following:

root = tree.getroot()
ns = re.match(r'{.*}', root.tag).group(0)

This way, you can use it later on in your code to find nodes, e.g using string interpolation (Python 3).

link = root.find(f"{ns}link")

edited Oct 11 '19 at 08:33

answered Oct 01 '18 at 12:25

Bram Vanroy

27,032
24
137
239

score 3 · Answer 6 · answered Apr 07 '21 at 16:13

This is basically Davide Brunato's answer however I found out that his answer had serious problems the default namespace being the empty string, at least on my python 3.6 installation. The function I distilled from his code and that worked for me is the following:

from io import StringIO
from xml.etree import ElementTree
def get_namespaces(xml_string):
    namespaces = dict([
            node for _, node in ElementTree.iterparse(
                StringIO(xml_string), events=['start-ns']
            )
    ])
    namespaces["ns0"] = namespaces[""]
    return namespaces

where ns0 is just a placeholder for the empty namespace and you can replace it by any random string you like.

If I then do:

my_namespaces = get_namespaces(my_schema)
root.findall('ns0:SomeTagWithDefaultNamespace', my_namespaces)

It also produces the correct answer for tags using the default namespace as well.

score 1 · Answer 7 · edited Jun 04 '19 at 00:49

My solution is based on @Martijn Pieters' comment:

register_namespace only influences serialisation, not search.

So the trick here is to use different dictionaries for serialization and for searching.

namespaces = {
    '': 'http://www.example.com/default-schema',
    'spec': 'http://www.example.com/specialized-schema',
}

Now, register all namespaces for parsing and writing:

for name, value in namespaces.iteritems():
    ET.register_namespace(name, value)

For searching (find(), findall(), iterfind()) we need a non-empty prefix. Pass these functions a modified dictionary (here I modify the original dictionary, but this must be made only after the namespaces are registered).

self.namespaces['default'] = self.namespaces['']

Now, the functions from the find() family can be used with the default prefix:

print root.find('default:myelem', namespaces)

but

tree.write(destination)

does not use any prefixes for elements in the default namespace.

Parsing XML with namespace in Python via 'ElementTree'

7 Answers7

Linked

Related