1

I'm trying to save an XML file encoded as UTF-16 with cElementTree. This is the same project, but different than the DOCTYPE issue in: How to create <!DOCTYPE> with Python's cElementTree

I've learned that if I do not declare the encoding in the string, cElementTree will add it. So, the code is like this:

import xml.etree.cElementTree as ElementTree
from StringIO import StringIO
s = '<?xml version=\"1.0\" ?><!DOCTYPE tmx SYSTEM \"tmx14a.dtd\" ><tmx version=\"1.4a\" />'
tree = ElementTree.parse(StringIO(s)).getroot()
header = ElementTree.SubElement(tree,'header',{'adminlang': 'EN',})
body = ElementTree.SubElement(tree,'body')
ElementTree.ElementTree(tree).write('myfile.tmx','UTF-16')

When I write the file with UTF-8, everthing's great. However, when I change to UTF-16, the text encoding is corrupted. It is also missing the required Byte Order Marker. When I try adding the BOM to the start of the string,

s = '\xFF\xFE<?xml version=\"1.0\"......

ElementTree reports the error "not well-formed (invalid token) line 1, column 1".

All the buffers are unicode data. How can I save to a UTF-16 XML file?

Community
  • 1
  • 1
tahoar
  • 1,788
  • 3
  • 20
  • 36

1 Answers1

4
resultstring = ElementTree.tostring(tree, encoding='utf-16')

P.S. Since the interface of the ElementTree module is duplicated by lxml library, it is a good idea to import ElementTree as etree. This will allow to reduce changes in case you will need more powerful lxml functionality.

newtover
  • 31,286
  • 11
  • 84
  • 89