17

I have tried to use the answer in this question, but can't make it work: How to create "virtual root" with Python's ElementTree?

Here's my code:

import xml.etree.cElementTree as ElementTree
from StringIO import StringIO
s = '<?xml version=\"1.0\" encoding=\"UTF-8\" ?><!DOCTYPE tmx SYSTEM \"tmx14a.dtd\" ><tmx version=\"1.4a\" />'
tree = ElementTree.parse(StringIO(s)).getroot()
header = ElementTree.SubElement(tree,'header',{'adminlang': 'EN',})
body = ElementTree.SubElement(tree,'body')
ElementTree.ElementTree(tree).write('myfile.tmx','UTF-8')

When I open the resulting 'myfile.tmx' file, it contains this:

<?xml version='1.0' encoding='UTF-8'?>
<tmx version="1.4a"><header adminlang="EN" /><body /></tmx>

What am I missing? or, is there a better tool?

Community
  • 1
  • 1
tahoar
  • 1,788
  • 3
  • 20
  • 36

4 Answers4

17

You could set xml_declaration argument on write function to False, so output won't have xml declaration with encoding, then just append what header you need manually. Actually if you set your encoding as 'utf-8' (lowercase), xml declaration won't be added too.

import xml.etree.cElementTree as ElementTree

tree = ElementTree.Element('tmx', {'version': '1.4a'})
ElementTree.SubElement(tree, 'header', {'adminlang': 'EN'})
ElementTree.SubElement(tree, 'body')

with open('myfile.tmx', 'wb') as f:
    f.write('<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE tmx SYSTEM "tmx14a.dtd">'.encode('utf8'))
    ElementTree.ElementTree(tree).write(f, 'utf-8')

Resulting file (newlines added manually for readability):

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
    <header adminlang="EN" />
    <body />
</tmx>
Abbas
  • 210
  • 2
  • 13
demalexx
  • 4,661
  • 1
  • 30
  • 34
  • can you explain how did you added new line to the xml? – Learner May 25 '16 at 05:11
  • @Learner: I added it manually for readability. If you want to have XML with new lines from ElementTree - search how to pretty print XML. – demalexx May 27 '16 at 13:47
  • This gives me an error `TypeError: write() argument must be str, not bytes` in python 3.6.4 on macOS. I think it's because you are writing first as a string, then as binary in the same open() command. – Elliott B Mar 28 '18 at 09:18
  • @ElliottB thanks, I updated code. Should work on both python 2 and 3. – demalexx Mar 29 '18 at 15:59
  • This solution doesn't work, except if you enter manually (as said) the ElementTree which is surely not what you want to do. I put a simple & stupid solution to this problem below. – Emilio Conte Mar 26 '19 at 08:40
  • @Learner You could simply insert a "\n" (without quotes) into the string between the XML declaration and the doctype. – posfan12 Jul 04 '19 at 12:56
13

You could use lxml and its tostring function:

from lxml import etree

s = """<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4a"/>""" 

tree = etree.fromstring(s)
header = etree.SubElement(tree,'header',{'adminlang': 'EN'})
body = etree.SubElement(tree,'body')

print etree.tostring(tree, encoding="UTF-8",
                     xml_declaration=True,
                     pretty_print=True,
                     doctype='<!DOCTYPE tmx SYSTEM "tmx14a.dtd">')

=>

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
  <header adminlang="EN"/>
  <body/>
</tmx>
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • 1
    I get this error: `ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.` with Python 3.6 – nowox Jun 28 '18 at 09:44
  • 1
    `etree.fromstring(s.encode("UTF-8"))` works for me with Python 3.6. – mzjn Jun 28 '18 at 11:21
2

I used different solution to add DOCTYPE, very simple, very stupid.

import xml.etree.ElementTree as ET

with open(path_file, "w", encoding='UTF-8') as xf:
    doc_type = '<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE dlg:window ' \
               'PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "dialog.dtd">'
    tostring = ET.tostring(root).decode('utf-8')
    file = f"{doc_type}{tostring}"
    xf.write(file)
Emilio Conte
  • 1,105
  • 10
  • 29
0

I couldn't find a solution to this problem either using vanilla ElementTree, and the solution proposed by demalexx created non-valid XML that was rejected by my application (DITA). What I propose is a workaround involving other modules and it works perfectly for me.

import re
# found no way for cleanly specify a <!DOCTYPE ...> stanza in ElementTree so
# so we substitute the current <?xml ... ?> stanza with a full <?xml... + <!DOCTYPE...
new_header = '<?xml version="1.0" encoding="UTF-8" ?>\n' \
                 '<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">\n'

target_xml = re.sub(u"\<\?xml .+?>", new_header, source_xml)
with open(filename, 'w') as catalog_file:
    catalog_file.write(target_xml.encode('utf8'))
rummidge
  • 51
  • 1
  • 4
  • Could you elaborate on the "non-valid XML" problem? – posfan12 Jul 04 '19 at 12:58
  • @posfan12, I'll guess that the main issue would have been not having the DTD at the beginning of the line, which is easy to fix in demalexx's answer. – Luis Oct 02 '20 at 17:23