0

I'm trying to write a SAX parser for an XHTML document that I download from the web. At first I was having a problem with the doctype declaration (I found out from here that it was because W3C have intentionally blocked access to the DTD), but I fixed that with:

XMLReader reader = parser.getXMLReader();
reader.setFeature("http://apache.org/xml/features/disallow-doctype-decl",true);

However, now I'm experiencing a second problem. The SAX parser throws an exception when it reaches some Javascript embedded in the XHTML document:

    <script type="text/javascript" language="JavaScript">
function checkForm() {
answer = true;
if (siw && siw.selectingSomething)
    answer = false;
    return answer;
}//
</script>

Specifically the parser throws an error once it reaches the &&'s, as it's expecting an entity reference. The exact exception is:

`org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference.
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198)
at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:391)
at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1390)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1814)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3000)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:624)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:486)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:810)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:740)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:110)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1208)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:525)
at MLIAParser.readPage(MLIAParser.java:55)
at MLIAParser.main(MLIAParser.java:75)`

I suspect (but I don't know) that if I hadn't disabled the DTD then I wouldn't get this error. So, how can I avoid the DTD error and avoid the entity reference error?

Cheers,

Pete

Community
  • 1
  • 1
  • Instead of disabling the DTD, I downloaded it, and added it into my software as an embedded resource; and so, then, when the parser wants it, I give it my local/downloaded/cached copy of the DTD, instead of getting it from the internet. This is better I think than completely disabling the DTD processing. – ChrisW Aug 16 '09 at 13:27

3 Answers3

3

The (X)HTML you are trying to parse is not valid XML (otherwise you wouldn't be getting a SAX parsing error). And, a double-ampersand ("&&") confirms that. That means that on its own, you can't use use a plain XML parser to parse the document.

There are tools you can use, such as TagSoup, which will generate proper SAX events (you can use the same SAX/XML parsing code as before), but TagSoup will take care of mapping the poorly-formed-HTML events to proper SAX/XML events.

Adam Batkin
  • 51,711
  • 9
  • 123
  • 115
1

I think you're supposed to put the script content in a CDATA section, for example http://www.w3schools.com/TAGS/tag_script.asp gives the following example:

<script type="text/javascript"><![CDATA[
document.write("Hello World!")
//]]></script>
ChrisW
  • 54,973
  • 13
  • 116
  • 224
  • Some additional info on this: xhtml is commonly served as mimetype text/html instead of application/xhtml+xml like it should, which is why bugs like these are possible. Also see http://www.w3.org/TR/2002/NOTE-xhtml-media-types-20020801/ – wds Aug 17 '09 at 10:00
0

NekoHTML will probably fix this for you as well, you use it as a the XMLReader.

If you're using a SAX filter, you might also be able to insert CDATA events after you encounter a startElement for <script>, although that might be parser-dependent as not all parsers support the LexicalHandler features.

thom_nic
  • 7,809
  • 6
  • 42
  • 43