I'm using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped.
Here's a relevant example of the input HTML (most of the <head>
cut for clarity):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<script type="text/JavaScript">
<!-- // Hide the JS
startTimeout(6000000, "/");
// -->
</script>
Here's the code:
// XOMSafeSAXParser is the Neko SAXParser extended to allow
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();
Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);
Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();
Here's the corresponding output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
<HEAD>
<SCRIPT type="text/JavaScript"> <!-- // Hide the JS startTimeout(6000000, "/"); // --> </SCRIPT>
</HEAD>
When I extract the script element from the XOM document, it looks like it's already been mangled (the SCRIPT element has one Text
node as a child, not the sequence of Texts
and Comments
I would expect), so I don't think it's the Serializer
that's going wrong.
Now, I don't expect the line breaks to be preserved and in fact I'm going to throw the script tags out anyway, but there are other places where I'd like comments to be preserved or at minimum like to be able to get text without escaped comments embedded in it.
Any ideas?
Update: NekoHTML was mangling some tags, so I switched to JTidy, and I have the same problem. Interestingly, though, it's only a problem for the script tag in the header; other comments come through fine. And there are weird extra JavaScript comments that I suspect (hope and pray) are JTidy's fault.
<script type="text/JavaScript"> // <!-- // Hide the JS startTimeout(6000000, "/"); // --> // </script>
It looks as though what JTidy's doing is converting <script>
contents to CDATA; when I send JTidy's raw outputut to stdout, I get this:
<script type="text/JavaScript">
//<![CDATA[
<!-- // Hide the JS
startTimeout(6000000, "/");
// -->
//]]>
</script>