0

If the following string is read and output using lxml, the umlauts are converted to entities.

import xml.etree.ElementTree as ET

root = ET.fromstring("<r><s>Die Häuser haben Dächer.</s></r>")
as_text = ET.tostring(root).decode("utf-8")
print(as_text)

Output:

<r><s>Die H&#228;user haben D&#228;cher.</s></r>

Expected output:

<r><s>Die Häuser haben Dächer.</s></r>

The umlauts are just an example. I generally want to disable entity conversions and instead keep the raw input symbols.

Can I disable entity conversion? Is there a safe method to reconvert the entities?

mzjn
  • 48,958
  • 13
  • 128
  • 248

1 Answers1

2

The default encoding used by tostring() is ASCII in both ElementTree and lxml.

To get the expected output, you can use encoding="unicode":

as_text = ET.tostring(root, encoding="unicode")
print(as_text)

References:

mzjn
  • 48,958
  • 13
  • 128
  • 248
  • Which encoding will I get exactly? utf-? –  Feb 20 '21 at 00:03
  • What if I want to save the string to a file as utf-8 regardless of the terminal's encoding? E.g. if I am on the Windows commandline. –  Feb 22 '21 at 22:15
  • Specify the wanted encoding of the file: `ET.ElementTree(root).write("out.xml", encoding="UTF-8")`. – mzjn Feb 23 '21 at 04:18