1

I'm pulling in data from a database and attempting to create an XML file from this data. The data is in UTF-8 and can contain characters such as á, š, or č. This is the code:

import xml.etree.cElementTree as ET

tree = ET.parse(metadata_file)
# ..some commands that alter the XML..
tree.write(metadata_file, encoding="UTF-8")

When writing the data, the script fails with:

Traceback (most recent call last):
  File "get-data.py", line 306, in <module>
    main()
  File "get-data.py", line 303, in main
    tree.write(metadata_file, encoding="UTF-8")
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 32: ordinal not in range(128)

The only way to prevent this is to decode the data written to the XML file with:

text = text.decode('utf-8')

but then the resulting file will contain e.g. &#269; rather than a č. Any idea how I can write the data to the file and keep it in UTF-8?

Edit:

This is the example of what the script does:

]$ echo "<data></data>" > test.xml
]$ cat test.xml
<data></data>
]$ python
Python 2.7.5 (default, Nov  3 2014, 14:33:39)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.cElementTree as ET
>>> tree = ET.parse('./test.xml')
>>> root = tree.getroot()
>>> new = ET.Element("elem")
>>> new.text = "á, š, or č"
>>> root.append(new)
>>> tree.write('./text.xml', encoding="UTF-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
mart1n
  • 5,969
  • 5
  • 46
  • 83
  • Possible duplicate: [Write xml utf-8 file with utf-8 data with ElementTree](http://stackoverflow.com/questions/10046755/write-xml-utf-8-file-with-utf-8-data-with-elementtree), [ElementTree and unicode](http://stackoverflow.com/questions/12349728/elementtree-and-unicode) – wwii Nov 28 '14 at 16:38

2 Answers2

1

Ah, finally got it, this is the correct way to do it:

]$ echo "<data></data>" > text.xml
]$ python
Python 2.7.5 (default, Nov  3 2014, 14:26:24)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.cElementTree as ET
>>>
>>> tree = ET.parse('./test.xml')
>>> root = tree.getroot()
>>> new = ET.Element("elem")
>>> new.text = "á, š, or č".decode('utf-8')
>>> root.append(new)
>>> tree.write('./textout.xml', encoding="UTF-8")
>>>
>>> exit()
]$ cat textout.xml
<?xml version='1.0' encoding='UTF-8'?>
<data><elem>á, š, or č</elem></data>

In my original solution, I was encoding it as UTF-8 in the write() but not decoding it with .decode('utf-8').

mart1n
  • 5,969
  • 5
  • 46
  • 83
0

The question does not make clear what kind of object metadata_file is.

If an ordinary file object is used, there are no errors, and the output is as expected:

>>> import xml.etree.cElementTree as ET
>>> stream = open('test.xml', 'wb+')
>>> stream.write(u"""\
... <root>characters such as á, š, or č.</root>
... """.encode('utf-8'))
>>> stream.seek(0)
>>> tree = ET.parse(stream)
>>> stream.close()
>>> ET.tostring(tree.getroot())
'<root>characters such as &#225;, &#353;, or &#269;.</root>'
>>> stream = open('test.xml', 'w')
>>> tree.write(stream, encoding='utf-8', xml_declaration=True)
>>> stream.close()
>>> open('test.xml').read()
"<?xml version='1.0' encoding='utf-8'?>\n<root>characters such as \xc3\xa1, \xc5\xa1, or \xc4\x8d.</root>"
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • It's a regular file: `metadata_file="./metadata.xml"`. I don't open the file like do though, I only get the tree as pasted in the question and modify it and then write it out. – mart1n Nov 28 '14 at 19:37
  • @mart1n. In that case, you have not shown the actual code that produces the errors. – ekhumoro Nov 28 '14 at 19:55
  • Ok, I replicated what my script does here: http://pastebin.com/v8sHDDgv Still throws the traceback despite encoding it in UTF-8. – mart1n Dec 01 '14 at 09:34
  • @mart1n: The pastebin code should be in the question; don't "hide" it in a comment to an answer. – mzjn Dec 01 '14 at 10:00
  • @mzjn Fair enough, I added it to the original question. – mart1n Dec 01 '14 at 10:18