copying input xml file and write exactly with Python

Question

Input xml file:

<?xml version="1.0"?>
<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
  <mode>PRESSURE_CONTROL</mode>
  <category>ADULT</category>
  <testcase id="1" type="UNIQUE">
    <parameter id="PEEP" value="1.0">true</parameter>
    <parameter id="CMV_FREQ" value="4.0">true</parameter>
    <parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
    <parameter id="I_E_RATIO" value="0.1">false</parameter>
  </testcase>
</res:testcases>

Python Code:

import xml.etree.ElementTree as ET

tree = ET.parse('/home/AlAhAb65/Desktop/input.xml')    
root = tree.getroot() 

root.attrib['type'] = 'AVA'

tree.write('/home/AlAhAb65/Desktop/output1.xml')

Output xml file:

<ns0:testcases id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="AVA" xmlns:ns0="urn:testcases">
  <mode>PRESSURE_CONTROL</mode>
  <category>ADULT</category>
  <testcase id="1" type="UNIQUE">
    <parameter id="PEEP" value="1.0">true</parameter>
    <parameter id="CMV_FREQ" value="4.0">true</parameter>
    <parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
    <parameter id="I_E_RATIO" value="0.1">false</parameter>
  </testcase>
</ns0:testcases>

The problem is when I am copying and writing the output xml file 3 unexpected things happen. They are given below: 1. The first line from the input xml file is removed automatically 2. In second line (in input), the text 'res' is replaced with 'ns0'. Same happens while closing the tag 3. The order of the attribute (of the second line of input) is changed. But I want to write (as output) the exact copy of xml file that I got as an input. Please help me in this regard.

Most XML libraries can only promise to round-trip documents faithfully if they start out in C14N (http://www.w3.org/TR/xml-c14n). What you ask is probably impossible for arbitrary input; make your inputs C14N-compliant and use lxml's C14N output tools, and then you'll be fine. — Charles Duffy, Jul 26 '13 at 15:11

Charles Duffy · Answer 1 · 2013-07-29T14:19:28.443

5

W3 has defined a Canonical XML standard. Documents written in this format can be faithfully round-tripped by any C14N-compliant toolchain.

In the case of lxml.etree (a more capable implementation of the ElementTree API with C14N support), this means that you need to do two things:

Convert your original input document into C14N form.
Use the ElementTree.write_c14n() call to generate your output document.

A C14N-form version of your input file will look like so (generated by the xmlstarlet c14n command):

<res:testcases xmlns:res="urn:testcases" id="a1e4bfdb-40a2-485c-a1ac-54d220056dd5" type="MODEL">
  <mode>PRESSURE_CONTROL</mode>
  <category>ADULT</category>
  <testcase id="1" type="UNIQUE">
    <parameter id="PEEP" value="1.0">true</parameter>
    <parameter id="CMV_FREQ" value="4.0">true</parameter>
    <parameter id="PRESS_ABOVE_PEEP" value="0.0">true</parameter>
    <parameter id="I_E_RATIO" value="0.1">false</parameter>
  </testcase>
</res:testcases>

...and an appropriately modified version of your code:

#!/usr/bin/env python

import lxml.etree

tree = lxml.etree.parse('input.xml')    
root = tree.getroot() 

root.attrib['type'] = 'AVA'

tree.write_c14n('output1.xml')

If you add an XML declaration (the <?xml version="1.0"?> line), you will be noncomplaint with the C14N standard. As such, this is something you absolutely should not do. If you really, really want to do this wrongheaded thing...

Don't.

But if you must, you'd do it like so:

outfile = open('output1.xml', 'w')
outfile.write('<?xml version="1.0"?>\n')
tree.write_c14n(outfile)
outfile.close()

edited Jul 29 '13 at 14:19

answered Jul 26 '13 at 15:16

Charles Duffy

280,126
43
390
441

1

This solves the 2nd line issue. But for adding this , does not solve – ahadcse Jul 29 '13 at 10:32
@ahadcse Step 1 does in fact solve that, because it takes the `` away from your original version. Yes, you need to modify your original to be in C14N. No, there isn't a way to get around that -- otherwise you're relying on luck for the round-trip to work. (If you don't use C14N, there aren't guarantees about what order attributes are printed in, how empty elements are serialized, or many other things). – Charles Duffy Jul 29 '13 at 12:07
Is there any way to just manually write this line? I tried but it overwrites the next line. In fact, I want to get the exact file as output. So if I can write with any way it will work for me. Otherwise, if I need to convert it to C14N form how can i do it in my python code? – ahadcse Jul 29 '13 at 12:30
@ahadcse Again -- just adding the line might get you the exact output you want for this specific case, but it's almost certainly not going to do what you want for the full range of valid inputs. The Right Thing (which is to say the standards-body approved, guaranteed-to-work-or-someone's-buggy approach to byte-for-byte round-tripping of XML) is C14N, and if you're using C14N, that line won't exist in your inputs either. In terms of how you do that in your Python code... well, `tree = lxml.etree.parse('input.xml'); tree.write_c14n('input.xml')` is doing just that. – Charles Duffy Jul 29 '13 at 14:11
@ahadcse Part of the point here is that if someone else asked you to write code to modify an XML document in-place, and they didn't give you that input document in C14N form, *that's their bug, not yours*. Report it to them. – Charles Duffy Jul 29 '13 at 14:12
@ahadcse I added an update on how to do the literal thing you asked for, in addition to an explanation of why it's standards-noncompliant and wrong. – Charles Duffy Jul 29 '13 at 14:20

score 2 · Answer 2 · edited May 23 '17 at 12:06

From the documentation page, the XML declaration can be added like this:

tree.write('/home/AlAhAb65/Desktop/output1.xml', xml_declaration=True)

You should also add the encoding because the default one is us-ascii:

tree.write('/home/AlAhAb65/Desktop/output1.xml', encoding='utf-8', xml_declaration=True)

Or you can retrieve the encoding from the original file, but in any case you will get a different XML declaration, probably something like this:

<?xml version="1.0" encoding="UTF-8"?>

Or you can manually add the XML declaration. Anyway a slight declaration mismatch should not be a problem for any robust XML parser as long as the declared encoding is coherent with the real encoding.

Attribute order is not significant in XML, so the information is probably lost when the file is parsed within the API. There is probably no simple way to make this work when processing the file through the standard ElementTree API. You would probably better have to go with lxml C14N support if you want to do minor changes to the file.

The namespace prefixes are changed by default in ElementTree. To prevent this behavior, you can switch to lxml which seems to preserve namespace prefixes by default:

Because etree is built on top of libxml2, which is namespace prefix aware, etree preserves namespaces declarations and prefixes while ElementTree tends to come up with its own prefixes (ns0, ns1, etc). When no namespace prefix is given, however, etree creates ElementTree style prefixes as well.

Switching to lxml is a good idea in any case, but the changes you observe should not be a problem if the program reading the file at the other end is XML compliant enough. Unfortunately a lot of XPath processors have issues with namespace prefixes changes...

write() got an unexpected keyword argument 'xml_declaration'. — ahadcse, Jul 29 '13 at 10:13
The 'xml_declaration' parameter was added in Python 2.7 so if you use any earlier version there's no way to force this API to output the declaration. You really sould take a look at libxml... — Maxime Rossini, Jul 29 '13 at 15:12

copying input xml file and write exactly with Python

2 Answers2