XML Parse Error with invalid HTML code (Elementtree)

Question

When I parse the xml string below taken from a larger xml file, I run into what I think is an invalid HTML character code, the parser outputs the following error message.

The error message was: ParseError: reference to invalid character number

I deleted the rest of the body of description and left the part that caused the error. How do I get elementtree to ignore these invalid HTML character codes or process them in some way?

The code and xml excerpt is below:

XML: <dc:description> **(10&#410)** </dc:description>


import os
import html
import io
import sys
import xml.etree.ElementTree as ET

def process_file(file):

    parser=ET.XMLParser(encoding='utf-8')
    tree=ET.parse(file, parser=parser)

score 0 · Answer 1 · answered Jun 11 '20 at 03:16

How do I get elementtree to ignore these invalid HTML character codes or process them in some way?

You don't

You're trying to apply an XML tool to non-XML data. It's properly refusing to cooperate.

The solution is to first fix your data to be XML before trying to process it as XML. Do this manually, or try to do it programmatically by processing the document at the character/string level.

See also How to parse invalid (bad / not well-formed) XML?

XML Parse Error with invalid HTML code (Elementtree)

1 Answers1