How to parse XML/XLIFF with HTML entities in Python 3.7

Question

I have following example code

example.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

xliff = '''<?xml version="1.0"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
  <file source-language="en" datatype="plaintext" original="ng2.template" target-language="de-DE">
    <body>
      <trans-unit id="ecb14a83c67a551ce9a04669d31465d977949484" datatype="html">
        <source>Something to translate</source>
        <target>Translations with&nbsp;entities &amp; stuff</target>
      </trans-unit>
    </body>
  </file>
</xliff>
'''

tree = ET.fromstring(xliff)
# same when using external file ET.parse(xliffPath)

when I run it in python 3.7 I am getting this error:

Traceback (most recent call last):
  File ".\example.py", line 17, in <module>
    tree = ET.fromstring(xliff)
  File "PathToPython37\lib\xml\etree\ElementTree.py", line 1315, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: undefined entity: line 7, column 33

It complains about &nbsp and & and other html entities.

The question is how can I parse XML containing HTML entities with python 3.7 and preferably xml.etree.ElementTree?

also i have checked https://stackoverflow.com/questions/35093594/how-do-i-parse-xml-that-contains-html-entities it had answer for earlier version of python and alternative answer suggested not well formatted xml but i don't see why entities would not be valid content of translations? Also checked few other kind of related questions but none of them gave me any hints how to solve it. — Xesenix, Sep 06 '19 at 10:59
There are no HTML entities in XML. This is syntactically invalid XML - how is it created? — Tomalak, Sep 06 '19 at 11:34
External tool also those translations in end are displayed as part of html. Why ElementTree tries at do anything with those entities if it doesn't understand them cant it ignore them? — Xesenix, Sep 06 '19 at 11:45
Those entities are not valid in XML. They are undefined. No XML parser can read those files, this is not limited to ElementTree. — Tomalak, Sep 06 '19 at 11:47
This is why I asked how those files are being created, because no XML-aware tool (e.g. DOM API library) would be able to create this. (Unless there is a special DOCTYPE that defines those entities, but that does not seem to be the case looking at your sample.) — Tomalak, Sep 06 '19 at 11:49
can I covert them to something else before using parser? I tried to look up xmlns that would contain some info about entities but didnt manage to find anything — Xesenix, Sep 06 '19 at 11:50
if all you need is to "get out the data" then you can try reading them with an HTML parser (the built-in ElementTree won't work, but lxml in HTML mode would). — Tomalak, Sep 06 '19 at 11:54
Thx for explanation will try to do that. Also after that discussion I found that there is set of predefined entities that should work with any XML and i can try to convert invalid ones to those valid ones. — Xesenix, Sep 06 '19 at 11:56
If you are thinking about search-and-replace before parsing those files... don't. If it's at all possible, use your time to change whatever generates those files into something that produces sane XML, instead of trying to fix the broken output of a broken tool. — Tomalak, Sep 06 '19 at 12:03
changing tool used by dozen of peoples in few companies or find and replace... hmmm i think i choose the later one :P — Xesenix, Sep 06 '19 at 12:10
if that was my tool i wouldn't have to fix it output outside of its code also you lost context of this conversation "external tool" — Xesenix, Sep 06 '19 at 12:26
going further I am the person who usually has to fix output of that tool for everybody... — Xesenix, Sep 06 '19 at 12:32
Yeah, but it's still an option to make the ones who own the tool aware that they are producing garbage. Also, from a technical perspective, if dozens of people are actively using these files, then they already have workarounds, maybe they are using HTML parsers (or god forbid, regex). And when the input files change from containing ` ` to ` ` (etc), then exactly nothing breaks for anyone who is already dealing with those files. But it would make the use of XML parsers possible, which kind of is the intention when XML is being used as a data format. — Tomalak, Sep 06 '19 at 12:32

How to parse XML/XLIFF with HTML entities in Python 3.7

0 Answers0