17

I have a document which uses an XML namespace for which I want to increase /group/house/dogs by one: (the file is called houses.xml)

<?xml version="1.0"?>
<group xmlns="http://dogs.house.local">
    <house>
            <id>2821</id>
            <dogs>2</dogs>
    </house>
</group>

My current result using the code below is: (the created file is called houses2.xml)

<ns0:group xmlns:ns0="http://dogs.house.local">
    <ns0:house>
        <ns0:id>2821</ns0:id>
        <ns0:dogs>3</ns0:dogs>
    </ns0:house>
</ns0:group>

I would like to fix two things (if it is possible using ElementTree. If it isn´t, I´d be greatful for a suggestion as to what I should use instead):

  1. I want to keep the <?xml version="1.0"?> line.
  2. I do not want to prefix all tags, I´d like to keep it as is.

In conclusion, I don´t want to mess with the document more than I absolutely have to.

My current code (which works except for the above mentioned flaws) generating the above result follows.

I have made a utility function which loads an XML file using ElementTree and returns the elementTree and the namespace (as I do not want to hard code the namespace, and am willing to take the risk it implies):

def elementTreeRootAndNamespace(xml_file):
    from xml.etree import ElementTree
    import re
    element_tree = ElementTree.parse(xml_file)

    # Search for a namespace on the root tag
    namespace_search = re.search('^({\S+})', element_tree.getroot().tag)
    # Keep the namespace empty if none exists, if a namespace exists set
    # namespace to {namespacename}
    namespace = ''
    if namespace_search:
        namespace = namespace_search.group(1)

    return element_tree, namespace

This is my code to update the number of dogs and save it to the new file houses2.xml:

elementTree, namespace = elementTreeRootAndNamespace('houses.xml')

# Insert the namespace before each tag when when finding current number of dogs,
# as ElementTree requires the namespace to be prefixed within {...} when a
# namespace is used in the document.
dogs = elementTree.find('{ns}house/{ns}dogs'.format(ns = namespace))

# Increase the number of dogs by one
dogs.text = str(int(dogs.text) + 1)

# Write the result to the new file houses2.xml.
elementTree.write('houses2.xml')
Deleted
  • 4,067
  • 6
  • 33
  • 51
  • I guess this is a duplicate of http://stackoverflow.com/questions/8983041/saving-xml-files-using-elementtree and http://stackoverflow.com/questions/18338807/cannot-write-xml-file-with-default-namespace – leo Nov 02 '13 at 20:15

4 Answers4

3

An XML based solution to this problem is to write a helper class for ElementTree which:

  • Grabs the XML-declaration line before parsing as ElementTree at the time of writing is unable to write an XML-declaration line without also writing an encoding attribute(I checked the source).
  • Parses the input file once, grabs the namespace of the root element. Registers that namespace with ElementTree as having the empty string as prefix. When that is done the source file is parsed using ElementTree again, with that new setting.

It has one major drawback:

  • XML-comments are lost. Which I have learned is not acceptable for this situation(I initially didn´t think the input data had any comments, but it turns out it has).

My helper class with example:

from xml.etree import ElementTree as ET
import re


class ElementTreeHelper():
    def __init__(self, xml_file_name):
        xml_file = open(xml_file_name, "rb")

        self.__parse_xml_declaration(xml_file)

        self.element_tree = ET.parse(xml_file)
        xml_file.seek(0)

        root_tag_namespace = self.__root_tag_namespace(self.element_tree)
        self.namespace = None
        if root_tag_namespace is not None:
            self.namespace = '{' + root_tag_namespace + '}'
            # Register the root tag namespace as having an empty prefix, as
            # this has to be done before parsing xml_file we re-parse.
            ET.register_namespace('', root_tag_namespace)
            self.element_tree = ET.parse(xml_file)

    def find(self, xpath_query):
        return self.element_tree.find(xpath_query)

    def write(self, xml_file_name):
        xml_file = open(xml_file_name, "wb")
        if self.xml_declaration_line is not None:
            xml_file.write(self.xml_declaration_line + '\n')

        return self.element_tree.write(xml_file)

    def __parse_xml_declaration(self, xml_file):
        first_line = xml_file.readline().strip()
        if first_line.startswith('<?xml') and first_line.endswith('?>'):
            self.xml_declaration_line = first_line
        else:
            self.xml_declaration_line = None
        xml_file.seek(0)

    def __root_tag_namespace(self, element_tree):
        namespace_search = re.search('^{(\S+)}', element_tree.getroot().tag)
        if namespace_search is not None:
            return namespace_search.group(1)
        else:
            return None


def __main():
    el_tree_hlp = ElementTreeHelper('houses.xml')

    dogs_tag = el_tree_hlp.element_tree.getroot().find(
                   '{ns}house/{ns}dogs'.format(
                         ns=el_tree_hlp.namespace))
    one_dog_added = int(dogs_tag.text.strip()) + 1
    dogs_tag.text = str(one_dog_added)

    el_tree_hlp.write('hejsan.xml')

if __name__ == '__main__':
    __main()

The output:

<?xml version="1.0"?>
<group xmlns="http://dogs.house.local">
    <house>
            <id>2821</id>
            <dogs>3</dogs>
    </house>
</group>

If someone has an improvement to this solution please don´t hesitate to grab the code and improve it.

Deleted
  • 4,067
  • 6
  • 33
  • 51
  • I´ll wait a week or so in case someone else (or me) has an improvement or a better solution all-together. – Deleted Mar 08 '12 at 11:53
  • Have you tried to add the namespace with an empty prefix beforehand? or do you have many namespaces and you can't say in advance which one it will be? – Aaron Digulla Mar 08 '12 at 13:17
  • 1
    I have many namespaces. I know it incurs a risk to deal with them in the same way, but for my data I think it´s safe enough. That´s why I parse the file once, grab the namespace and parse again. I could have parsed the namespace myself by file io, then thefile.seek(0) and let ElementTree do its thing on thefile. But I wanted the same logic to parse the namespace part both times. – Deleted Mar 09 '12 at 09:19
  • I´m using the hacky solution in the accepted answer. I´m not fully comfortable with it but it´ll have to do. – Deleted Mar 09 '12 at 09:19
  • The answer below has helped me. Abeltang – Zeus Oct 06 '15 at 17:39
  • Your code works great for my app, I'm considering make it as an xml helper class for future projects. I think there might be some improvements to make it more robust, like, use `with open('xml_file_name') as xml_file:` to replace `xml_file = open(xml_file_name, "rb")`, Check if the designated xml file existed before read & write, etc. – shi jing Sep 07 '21 at 02:15
1

when Save xml add default_namespace argument is easy to avoid ns0, on my code

key code: xmltree.write(xmlfiile,"utf-8",default_namespace=xmlnamespace)

if os.path.isfile(xmlfiile):
            xmltree = ET.parse(xmlfiile)
            root = xmltree.getroot()
            xmlnamespace = root.tag.split('{')[1].split('}')[0]  //get namespace

            initwin=xmltree.find("./{"+ xmlnamespace +"}test")
            initwin.find("./{"+ xmlnamespace +"}content").text = "aaa"
            xmltree.write(xmlfiile,"utf-8",default_namespace=xmlnamespace)
Abeltang
  • 21
  • 1
1

etree from lxml provides this feature.

  1. elementTree.write('houses2.xml',encoding = "UTF-8",xml_declaration = True) helps you in not omitting the declaration

  2. While writing into the file it does not change the namespaces.

http://lxml.de/parsing.html is the link for its tutorial.

P.S : lxml should be installed separately.

Racil Hilan
  • 24,690
  • 13
  • 50
  • 55
Anon
  • 129
  • 1
  • 9
1

Round-tripping, unfortunately, isn't a trivial problem. With XML, it's generally not possible to preserve the original document unless you use a special parser (like DecentXML but that's for Java).

Depending on your needs, you have the following options:

  • If you control the source and you can secure your code with unit tests, you can write your own, simple parser. This parser doesn't accept XML but only a limited subset. You can, for example, read the whole document as a string and then use Python's string operations to locate <dogs> and replace anything up to the next <. Hack? Yes.

  • You can filter the output. XML allows the string <ns0: only in one place, so you can search&replace it with < and then the same with <group xmlns:ns0="<group xmlns=". This is pretty safe unless you can have CDATA in your XML.

  • You can write your own, simple XML parser. Read the input as a string and then create Elements for each pair of <> plus their positions in the input. That allows you to take the input apart quickly but only works for small inputs.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • Thanks. Yes, it takes longer than you'd think for something as simple as XML... ;-) – Aaron Digulla Mar 09 '12 at 09:24
  • As I´ve found no nice XML-based parser which is able to do this I´m marking this answer as correct. I´m continuing to use the solution you proposed at the first list item (which was already present in the code I am converting to Python). To get what I want I would more or less have to write a custom parser, which would take too long for the task I´m solving (albeit it would be fun). Just removing – Deleted Mar 09 '12 at 13:28