6

I am trying to parse an HTML document with the doctype declared to use the transitional dtd as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

When I do Builder.build on the document, I get the following exception:

  java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
       at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1305)
       at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
       at org.apache.xerces.impl.XMLEntityManager.startDTDEntity(Unknown Source)
       at org.apache.xerces.impl.XMLDTDScannerImpl.setInputSource(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown Source)
       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
       at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
       at nu.xom.Builder.build(Builder.java:1127)
       at nu.xom.Builder.build(Builder.java:1019)

If I remove the doc type declaration, it parses just fine. I can successfully download the dtd from my browser, which tells me that the url is valid. I don't want to remove the doc type declaration. Is there a way tell the builder not to download the dtd or provide it with an alternate dtd?

Bala
  • 979
  • 1
  • 10
  • 21
  • Are you parsing html from 'the wild' or did you create/have control over the pages you're parsing? – lucas Jun 15 '09 at 21:10
  • I have control over the html I am parsing, so at the very least I can remove the doctype declaration. But I am trying to stick to good practices and retain the doctype declaration. – Bala Jun 16 '09 at 00:21

2 Answers2

7

This solves the problem:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setValidating(false);
            factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
            Document document = factory.newDocumentBuilder().parse(is);
agori
  • 481
  • 6
  • 14
4

Taking a quick look at the javadoc for Builder, I guess you could provide an EntityResolver via the constructor that takes a XMLReader. I would avoid letting the parser download files from the internet where possible.

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • 1
    org.apache.xerces.parsers.SAXParser xmlReader = new SAXParser(); xmlReader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); Builder xomBuilder = new Builder(xmlReader); – Bala Jun 16 '09 at 17:18
  • 7
    Why the 503's were happening: http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic – Bala Jun 16 '09 at 17:23
  • 1
    Instead of disabling the DTD, I downloaded it, and added it into my software as an embedded resource; and so, then, when the parser wants it, I give it my local/downloaded/cached copy of the DTD, instead of getting it from the internet. This is better I think than completely disabling the DTD processing. – ChrisW Aug 16 '09 at 13:25