DTD download error while parsing XHTML document in XOM

Question

I am trying to parse an HTML document with the doctype declared to use the transitional dtd as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

When I do Builder.build on the document, I get the following exception:

  java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
       at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1305)
       at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
       at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
       at nu.xom.Builder.build(Builder.java:1127)
       at nu.xom.Builder.build(Builder.java:1019)

If I remove the doc type declaration, it parses just fine. I can successfully download the dtd from my browser, which tells me that the url is valid. I don't want to remove the doc type declaration. Is there a way tell the builder not to download the dtd or provide it with an alternate dtd?

Are you parsing html from 'the wild' or did you create/have control over the pages you're parsing? — lucas, Jun 15 '09 at 21:10
I have control over the html I am parsing, so at the very least I can remove the doctype declaration. But I am trying to stick to good practices and retain the doctype declaration. — Bala, Jun 16 '09 at 00:21

score 7 · Answer 1 · answered Jan 26 '10 at 14:03

This solves the problem:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setValidating(false);
            factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
            Document document = factory.newDocumentBuilder().parse(is);

score 4 · Accepted Answer · answered Jun 15 '09 at 22:32

4

Taking a quick look at the javadoc for Builder, I guess you could provide an EntityResolver via the constructor that takes a XMLReader. I would avoid letting the parser download files from the internet where possible.

answered Jun 15 '09 at 22:32

McDowell

107,573
31
204
267

1

org.apache.xerces.parsers.SAXParser xmlReader = new SAXParser(); xmlReader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); Builder xomBuilder = new Builder(xmlReader); – Bala Jun 16 '09 at 17:18
7

Why the 503's were happening: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic – Bala Jun 16 '09 at 17:23
1

Instead of disabling the DTD, I downloaded it, and added it into my software as an embedded resource; and so, then, when the parser wants it, I give it my local/downloaded/cached copy of the DTD, instead of getting it from the internet. This is better I think than completely disabling the DTD processing. – ChrisW Aug 16 '09 at 13:25

DTD download error while parsing XHTML document in XOM

2 Answers2

Linked