178

I want to use the method of findall to locate some elements of the source xml file in the ElementTree module.

However, the source xml file (test.xml) has namespaces. I truncate part of xml file as sample:

<?xml version="1.0" encoding="iso-8859-1"?>
<XML_HEADER xmlns="http://www.test.com">
    <TYPE>Updates</TYPE>
    <DATE>9/26/2012 10:30:34 AM</DATE>
    <COPYRIGHT_NOTICE>All Rights Reserved.</COPYRIGHT_NOTICE>
    <LICENSE>newlicense.htm</LICENSE>
    <DEAL_LEVEL>
        <PAID_OFF>N</PAID_OFF>
        </DEAL_LEVEL>
</XML_HEADER>

The sample python code is below:

from xml.etree import ElementTree as ET
tree = ET.parse(r"test.xml")
el1 = tree.findall("DEAL_LEVEL/PAID_OFF") # Return None
el2 = tree.findall("{http://www.test.com}DEAL_LEVEL/{http://www.test.com}PAID_OFF") # Return <Element '{http://www.test.com}DEAL_LEVEL/PAID_OFF' at 0xb78b90>

Though using "{http://www.test.com}" works, it's very inconvenient to add a namespace in front of each tag.

How can I ignore the namespace when using functions like find, findall, ...?

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
KevinLeng
  • 1,843
  • 2
  • 12
  • 7
  • 24
    Is `tree.findall("xmlns:DEAL_LEVEL/xmlns:PAID_OFF", namespaces={'xmlns': 'http://www.test.com'})` convenient enough? – iMom0 Nov 16 '12 at 08:57
  • Thanks very much. I try your method and it can work. It's more convenient than mine but it's still a little awkward. Do you know if there is no other proper method in ElementTree module to solve this issue or there is no such method at all? – KevinLeng Nov 16 '12 at 09:17
  • 1
    Or try `tree.findall("{0}DEAL_LEVEL/{0}PAID_OFF".format('{http://www.test.com}'))` – Warf Apr 27 '20 at 09:46
  • 3
    In Python 3.8, a wildcard can be used for the namespace. https://stackoverflow.com/a/62117710/407651 – mzjn Jun 26 '20 at 03:27

13 Answers13

74

Instead of modifying the XML document itself, it's best to parse it and then modify the tags in the result. This way you can handle multiple namespaces and namespace aliases:

from io import StringIO  # for Python 2 import from StringIO instead
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(StringIO(xml))
for _, el in it:
    _, _, el.tag = el.tag.rpartition('}') # strip ns
root = it.root

This is based on the discussion here.

Neuron
  • 5,141
  • 5
  • 38
  • 59
nonagon
  • 3,271
  • 1
  • 29
  • 42
  • 5
    This. This this this. Multiple name spaces were going to be the death of me. – Jess Oct 11 '14 at 03:08
  • 17
    OK, this is nice and more advanced, but still it's not `et.findall('{*}sometag')`. And it also is mangling the element tree itself, not just "perform the search ignoring namespaces just this time, without re-parsing the document etc, retaining the namespace information". Well, for that case you observably need to iterate through the tree, and see for yourself, if the node matches your wishes after removing the namespace. – Tomasz Gandor Nov 14 '14 at 15:12
  • 1
    This works by stripping the string but when i save the XML file using write(...) the namespace dissapears from the begging of the XML xmlns="http://bla" dissapears. Please advice – TraceKira Aug 29 '16 at 19:28
  • 2
    @TomaszGandor: you could add the namespace to a separate attribute, perhaps. For simple tag containment tests (*does this document contain this tag name?*) this solution is great and can be short-circuited. – Martijn Pieters Oct 01 '19 at 15:37
  • @TraceKira: this technique removes namespaces from the parsed document, and you can't use that to create a new XML string with namespaces. Either store the namespace values in an extra attribute (and put the namespace back in before turning the XML tree back into a string) or re-parse from the original source to apply changes to that based on the stripped tree. – Martijn Pieters Oct 01 '19 at 15:40
  • @TomaszGandor Important to point out that wildcards in elementtree are only available from python 3.8 onwards. – DryLabRebel Jul 28 '23 at 06:30
46

If you remove the xmlns attribute from the xml before parsing it then there won't be a namespace prepended to each tag in the tree.

import re

xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)
david.barkhuizen
  • 5,239
  • 4
  • 36
  • 38
user2212280
  • 637
  • 5
  • 3
  • 5
    This worked in many cases for me, but then I ran into multiple namespaces and namespace aliases. See my answer for another approach that handles these cases. – nonagon Sep 18 '14 at 19:38
  • 59
    -1 manipulating the xml via a regular expression before parsing is just wrong. though it might work in some cases, this should not be the top voted answer and should not be used in a professional application. – Mike Feb 15 '15 at 19:48
  • 2
    Apart from the fact that using a regex for a XML parsing job is inherently unsound, this is **not going to work for many XML documents**, because it ignores namespace prefixes, and the fact that XML syntax allows for arbitrary whitespace before attribute names (not just spaces) and around the `=` equals sign. – Martijn Pieters Oct 01 '19 at 15:31
  • Yes, it's quick and dirty, but it's definitely the most elegant solution for simple use cases, thanks! – rimkashox Jun 13 '20 at 10:14
19

The answers so far explicitely put the namespace value in the script. For a more generic solution, I would rather extract the namespace from the xml:

import re
def get_namespace(element):
  m = re.match('\{.*\}', element.tag)
  return m.group(0) if m else ''

And use it in find method:

namespace = get_namespace(tree.getroot())
print tree.find('./{0}parent/{0}version'.format(namespace)).text
Pierluigi Vernetto
  • 1,954
  • 1
  • 25
  • 27
wimous
  • 243
  • 2
  • 5
14

Here's an extension to @nonagon answer (which removes namespace from tags) to also remove namespace from attributes:

import io
import xml.etree.ElementTree as ET

# instead of ET.fromstring(xml)
it = ET.iterparse(io.StringIO(xml))
for _, el in it:
    if '}' in el.tag:
        el.tag = el.tag.split('}', 1)[1]  # strip all namespaces
    for at in list(el.attrib.keys()): # strip namespaces of attributes too
        if '}' in at:
            newat = at.split('}', 1)[1]
            el.attrib[newat] = el.attrib[at]
            del el.attrib[at]
root = it.root

Obviously this is a permanent defacing of the XML but if that's acceptable because there are no non-unique tag names and because you won't be writing the file needing the original namespaces then this can make accessing it a lot easier

14

Improving on the answer by ericspod:

Instead of changing the parse mode globally we can wrap this in an object supporting the with construct.

from xml.parsers import expat

class DisableXmlNamespaces:
    def __enter__(self):
        self.old_parser_create = expat.ParserCreate
        expat.ParserCreate = lambda encoding, sep: self.old_parser_create(encoding, None)

    def __exit__(self, type, value, traceback):
        expat.ParserCreate = self.oldcreate

This can then be used as follows

import xml.etree.ElementTree as ET
with DisableXmlNamespaces():
     tree = ET.parse("test.xml")

The beauty of this way is that it does not change any behaviour for unrelated code outside the with block. I ended up creating this after getting errors in unrelated libraries after using the version by ericspod which also happened to use expat.

Neuron
  • 5,141
  • 5
  • 38
  • 59
lijat
  • 640
  • 7
  • 16
  • This is sweet AND healthy! Saved my day! +1 – AndreasT Dec 26 '18 at 00:42
  • 2
    In Python 3.8 (have not tested with other versions) this does not appear to work for me. Looking at the source it *should* work, but it seems the source code for `xml.etree.ElementTree.XMLParser` is somehow optimized and monkey-patching `expat` has absolutely no effect. – Reinderien May 22 '20 at 02:17
  • 2
    Ah, yeah. See @barny's comment: https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma#comment96135721_48344935 – Reinderien May 22 '20 at 02:25
6

You can use the elegant string formatting construct as well:

ns='http://www.test.com'
el2 = tree.findall("{%s}DEAL_LEVEL/{%s}PAID_OFF" %(ns,ns))

or, if you're sure that PAID_OFF only appears in one level in tree:

el2 = tree.findall(".//{%s}PAID_OFF" % ns)
tzp
  • 544
  • 7
  • 10
6

In python 3.5 , you can pass the namespace as an argument in find(). For example ,

ns= {'xml_test':'http://www.test.com'}
tree = ET.parse(r"test.xml")
el1 = tree.findall("xml_test:DEAL_LEVEL/xml_test:PAID_OFF",ns)

Documentation link :- https://docs.python.org/3.5/library/xml.etree.elementtree.html#parsing-xml-with-namespaces

karthik prasanna
  • 346
  • 3
  • 12
5

I might be late for this but I dont think re.sub is a good solution.

However the rewrite xml.parsers.expat does not work for Python 3.x versions,

The main culprit is the xml/etree/ElementTree.py see bottom of the source code

# Import the C accelerators
try:
    # Element is going to be shadowed by the C implementation. We need to keep
    # the Python version of it accessible for some "creative" by external code
    # (see tests)
    _Element_Py = Element

    # Element, SubElement, ParseError, TreeBuilder, XMLParser
    from _elementtree import *
except ImportError:
    pass

Which is kinda sad.

The solution is to get rid of it first.

import _elementtree
try:
    del _elementtree.XMLParser
except AttributeError:
    # in case deleted twice
    pass
else:
    from xml.parsers import expat  # NOQA: F811
    oldcreate = expat.ParserCreate
    expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

Tested on Python 3.6.

Try try statement is useful in case somewhere in your code you reload or import a module twice you get some strange errors like

  • maximum recursion depth exceeded
  • AttributeError: XMLParser

btw damn the etree source code looks really messy.

est
  • 11,429
  • 14
  • 70
  • 118
4

If you're using ElementTree and not cElementTree you can force Expat to ignore namespace processing by replacing ParserCreate():

from xml.parsers import expat
oldcreate = expat.ParserCreate
expat.ParserCreate = lambda encoding, sep: oldcreate(encoding, None)

ElementTree tries to use Expat by calling ParserCreate() but provides no option to not provide a namespace separator string, the above code will cause it to be ignore but be warned this could break other things.

ericspod
  • 113
  • 1
  • 6
  • This is a better way than other current answers as it does not depend on string processing – lijat Dec 11 '18 at 12:42
  • 3
    In python 3.7.2 (and possibly eariler) AFAICT it's no longer possible to avoid using cElementTree, so this workaround may not be possible :-( – DisappointedByUnaccountableMod Feb 13 '19 at 14:52
  • 1
    cElemTree is deprecated but there is [shadowing of types being done with C accelerators](https://github.com/python/cpython/blob/master/Lib/xml/etree/ElementTree.py#L1630). The C code isn't calling into expat so yes this solution is broken. – ericspod Feb 19 '19 at 14:31
  • @barny it's still possible, `ElementTree.fromstring(s, parser=None)` I am trying to pass parser to it. – est Mar 20 '19 at 12:24
  • This worked for me. I was struggling to ignore the namespaces and did't want to use lxml. Tried many options finally this made my day. Thank you @ericspod – dkoder Dec 14 '21 at 11:03
2

Let's combine nonagon's answer with mzjn's answer to a related question:

def parse_xml(xml_path: Path) -> Tuple[ET.Element, Dict[str, str]]:
    xml_iter = ET.iterparse(xml_path, events=["start-ns"])
    xml_namespaces = dict(prefix_namespace_pair for _, prefix_namespace_pair in xml_iter)
    return xml_iter.root, xml_namespaces

Using this function we:

  1. Create an iterator to get both namespaces and a parsed tree object.

  2. Iterate over the created iterator to get the namespaces dict that we can later pass in each find() or findall() call as sugested by iMom0.

  3. Return the parsed tree's root element object and namespaces.

I think this is the best approach all around as there's no manipulation either of a source XML or resulting parsed xml.etree.ElementTree output whatsoever involved.

I'd like also to credit balmy's answer with providing an essential piece of this puzzle (that you can get the parsed root from the iterator). Until that I actually traversed XML tree twice in my application (once to get namespaces, second for a root).

z33k
  • 3,280
  • 6
  • 24
  • 38
  • found out how to use it, but it doesn't work for me, I still see the namespaces in the output – taiko Feb 14 '20 at 18:46
  • 1
    Look at [iMom0's comment to OP's question](https://stackoverflow.com/questions/13412496/python-elementtree-module-how-to-ignore-the-namespace-of-xml-files-to-locate-ma/57474364?noredirect=1#comment18328039_13412496). Using this function you get both the parsed object and the means to query it with `find()` and `findall()`. You just feed those methods with the namespaces's dict from `parse_xml()` and use **namespace's prefix** in your queries. Eg: `et_element.findall(".//some_ns_prefix:some_xml_tag", namespaces=xml_namespaces)` – z33k Feb 17 '20 at 08:53
  • Does this really answer the OP "it's very inconvenient to add a namespace in front of each tag"? – DisappointedByUnaccountableMod May 12 '21 at 17:02
1

Since xml.etree.ElementTree 3.8 version, you can query node with wildcard namespace.

{namespace}* selects all tags in the given namespace, {}spam selects tags named spam in any (or no) namespace, and {} only selects tags that are not in a namespace.

So it would be:

tree.findall('.//{*} DEAL_LEVEL')
Jeffrey C
  • 364
  • 4
  • 13
0

to ignore the default namespace in the root node, feed a patched root-node-start to the parser, and then continue parsing the original XML stream.

for example, instead of <XML_HEADER xmlns="http://www.test.com">, feed <XML_HEADER> to the parser.

limitation: only the default namespace can be ignored. when the document contains namespace-prefixed nodes like <some-ns:some-name>, then lxml will throw lxml.etree.XMLSyntaxError: Namespace prefix some-ns on some-name is not defined.

limitation: currently, this ignores the original encoding from <?xml encoding="..."?>.

#! /usr/bin/env python3

import lxml.etree
import io



def parse_xml_stream(xml_stream, ignore_default_ns=True):
    """
    ignore_default_ns:
    ignore the default namespace of the root node.

    by default, lxml.etree.iterparse
    returns the namespace in every element.tag.

    with ignore_default_ns=True,
    element.tag returns only the element's localname,
    without the namespace.

    example:
    xml_string:
        <html xmlns="http://www.w3.org/1999/xhtml">
            <div>hello</div>
        </html>
    with ignore_default_ns=False:
        element.tag = "{http://www.w3.org/1999/xhtml}div"
    with ignore_default_ns=True:
        element.tag = "div"

    see also:
    Python ElementTree module: How to ignore the namespace of XML files
    https://stackoverflow.com/a/76601149/10440128
    """

    # save the original read method
    xml_stream_read = xml_stream.read

    if ignore_default_ns:
        def xml_stream_read_track(_size):
            # ignore size, always return 1 byte
            # so we can track node positions
            return xml_stream_read(1)
        xml_stream.read = xml_stream_read_track

    def get_parser(stream):
        return lxml.etree.iterparse(
            stream,
            events=('start', 'end'),
            remove_blank_text=True,
            huge_tree=True,
        )

    if ignore_default_ns:
        # parser 1
        parser = get_parser(xml_stream)

        # parse start of root node
        event, element = next(parser)
        #print(xml_stream.tell(), event, element)
        # get name of root node
        root_name = element.tag.split("}")[-1]
        #print("root name", root_name)
        #print("root pos", xml_stream.tell()) # end of start-tag
        # attributes with namespaces
        #print("root attrib", element.attrib)

        # patched document header without namespaces
        xml_stream_nons = io.BytesIO(b"\n".join([
            #b"""<?xml version="1.0" encoding="utf-8"?>""",
            b"<" + root_name.encode("utf8") + b"><dummy/>",
        ]))
        xml_stream.read = xml_stream_nons.read

    # parser 2
    parser = get_parser(xml_stream)

    # parse start of root node
    # note: if you only need "end" events,
    # then wait for end of dummy node
    event, element = next(parser)
    print(event, element.tag)
    assert event == "start"

    if ignore_default_ns:
        assert element.tag == root_name

        # parse start of dummy node
        event, element = next(parser)
        #print(event, element.tag)
        assert event == "start"
        assert element.tag == "dummy"

        # parse end of dummy node
        event, element = next(parser)
        #print(event, element.tag)
        assert event == "end"
        assert element.tag == "dummy"

        # restore the original read method
        xml_stream.read = xml_stream_read

        # now all elements come without namespace
        # so element.tag is the element's localname
        #print("---")

    # TODO handle events

    #for i in range(5):
    #    event, element = next(parser)
    #    print(event, element)

    for event, element in parser:
        print(event, element.tag)



# xml with namespace in root node
xml_bytes = b"""\
<?xml version="1.0" encoding="utf-8"?>
<doc version="1" xmlns="http://www.test.com">
    <node/>
    <!--
        limitation: this breaks the parser.
        lxml.etree.XMLSyntaxError:
        Namespace prefix some-ns on some-name is not defined
        <some-ns:some-name/>
    -->
</doc>
"""

print("# keep default namespace")
parse_xml_stream(io.BytesIO(xml_bytes), False)

print()

print("# ignore default namespace")
parse_xml_stream(io.BytesIO(xml_bytes))

outputs of print(event, element.tag):

# keep default namespace
start {http://www.test.com}doc
start {http://www.test.com}node
end {http://www.test.com}node
end {http://www.test.com}doc

# ignore default namespace
start doc
start node
end node
end doc
milahu
  • 2,447
  • 1
  • 18
  • 25
-2

Just by chance dropped into the answer here: XSD conditional type assignment default type confusion?. This is not the exact answer for the topic question but may be applicable if the namespace is not critical.

<?xml version="1.0" encoding="UTF-8"?>
<persons xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="test.xsd">
    <person version="1">
        <firstname>toto</firstname>
        <lastname>tutu</lastname>
    </person>
</persons>

Also see: https://www.w3.org/TR/xmlschema-1/#xsi_schemaLocation

Works for me. I call an XML validation procedure in my application. But also I want to quickly see the validation highliting and autocompletion in PyCharm when editing the XML. This noNamespaceSchemaLocation attribute does what I need.

RECHECKED

from xml.etree import ElementTree as ET
tree = ET.parse("test.xml")
el1 = tree.findall("person/firstname")
print(el1[0].text)
el2 = tree.find("person/lastname")
print(el2.text)

Returnrs

>python test.py
toto
tutu
Nick Legend
  • 789
  • 1
  • 7
  • 21
  • 1
    This does not solve the problem. It looks like an answer to a different question. – mzjn Feb 08 '21 at 15:14
  • @mzjn Double checked. Please tell if it's not what you would expect. – Nick Legend Feb 12 '21 at 16:22
  • What does your XML or code have to do with the question? The elements in your XML are not bound to any namespace. The XML in the question has no `firstname` or `lastname` elements. – mzjn Feb 12 '21 at 16:38
  • Agree, thanks for the clarification :) Anyway think the answer is useful for those (including me) who don't mind unbinding the namespace in order to resolve the issue. – Nick Legend Feb 12 '21 at 16:53