How do I convert XML to HTML in python?

Question

I can't get xml.etree.ElementTree to print or acknowledge the correct XHTML header. It insists on giving a generic XML header, prefixing all tags with "html:", throwing exceptions, or a combination of those.

How do I create a valid XHTML document in the first place?

I've got about 4 megabytes of xml files, and I'm trying to create a valid epub from them. There's various munging that needs to be done, <chapter> tags have no place in xhtml, for instance.

the following code:

    import xml.etree.ElementTree as ET
    xhtml = ET.fromstring(                                                                          
    "<?xml version=\"1.0\" xmlns=\"http://www.w3.org/1999/xhtml\" ?>\n<head><title></title></head>\n<body>\n</body>")

throws:

xml.etree.ElementTree.ParseError: XML declaration not well-formed: line 1, column 31

If I instead give the "correct" xhtml header, it insists it's html, gives it's own xml header, and prefixes all tags with "html:"

If I give the "correct" xml header, then epubcheck complains about "" not being a valid namespace (which I suppose it isn't).

The theory is that if I could create (and subsequently write out) a valid xhtml document, I could parse my xml for the <body> and <title> that's needed, mung them appropriately (href and src's all need changed, for instance), stick them in there, and be golden.

According to what I've found, a valid xhtml document MUST start with <xhtml xmlns="http://www.w3.org/1999/xhtml> and contain a head (with required title element) and a body. I'm not certain what (if any) of that I can leave out and still pass epubcheck's requirements.

Surely there's a way to force ET to use the correct header? Or do I need to use a different library, or what?

score 1 · Answer 1 · answered Jul 28 '19 at 22:45

One way to achieve this is to use XSLT Transformation. Most programming languages including Python will have support to convert an XML document into another document (e.g. HTML) when supplied with an XSL.

A good tutorial on XSLT Transformation can be found here

Use of Python to achieve transformation (once an XSL is prepared) is described here

score 1 · Answer 2 · answered Jul 29 '19 at 08:07

There are several things wrong with your XHTML source. First, xmlns is not a correct attribute for the xml declaration; it should be put on the root element instead. And the root element for XHTML is <html>, not <xhtml>. So the valid XHTML input in this particular case would be

<?xml version=\"1.0\"?>\n<html xmlns=\"http://www.w3.org/1999/xhtml\">\n<head><title></title></head>\n<body>\n</body></html>

That said, I'm not sure if xml.etree.ElementTree accepts that, having no experience with it.

How do I convert XML to HTML in python?

2 Answers2