8

I'm using Java 6. I want to parse XHTML that I know is well-formed. As such, I don't want to do any validation against DTD's or other schemas referenced in the doc. However, I'm having trouble figuring out how to turn that validation off. I have

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setValidating(false);
    final DocumentBuilder b = factory.newDocumentBuilder();
    final InputSource s = new InputSource(new StringReader(str));
    org.w3c.dom.Document result = b.parse(s);

But I still get an exception on the last line ...

java.net.SocketException: Unexpected end of file from server
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:777)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:774)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:677)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEntityManager.java:1315)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startDTDEntity(XMLEntityManager.java:1282)
    at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.setInputSource(XMLDTDScannerImpl.java:283)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(XMLDocumentScannerImpl.java:1194)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(XMLDocumentScannerImpl.java:1090)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1003)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:235)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
    at com.myco.myproj.util.XmlUtilities.getStringAsDocument(XmlUtilities.java:130)
    at com.myco.myproj.util.NetUtilities.getUrlAsDocument(NetUtilities.java:30)
    at com.myco.myproj.parsers.impl.AbstractChicagoReaderParser.parsePage(AbstractChicagoReaderParser.java:144)
    at com.myco.myproj.parsers.impl.AbstractChicagoReaderParser.getEvents(AbstractChicagoReaderParser.java:112)
    at com.myco.myproj.parsers.impl.ChicagoReaderParserTest.testParser(ChicagoReaderParserTest.java:29)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:76)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

I don't want my parser going to the Internet. How do I disable that? Thanks, - Dave

Edit: Per Traroth's suggestion, I tried the below code, but get the same exception

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setValidating(false);
    final DocumentBuilder builder = factory.newDocumentBuilder();
    builder.setEntityResolver(new EntityResolver() {
        @Override
            public InputSource resolveEntity(String publicId, String systemId) {
                    return null;
            }
        });
    final InputSource s = new InputSource(new StringReader(str));
    org.w3c.dom.Document result = builder.parse(s);
Dave
  • 15,639
  • 133
  • 442
  • 830

2 Answers2

5

Here is how you create a DocumentBuilder that will ignore ALL external referenced entities, including DTDs:

final DocumentBuilder builder = factory.newDocumentBuilder();
builder.setEntityResolver(new EntityResolver() {
    @Override
        public InputSource resolveEntity(String publicId, String systemId) {
                // it might be a good idea to insert a trace logging here that you are ignoring publicId/systemId
                return new InputSource(new StringReader("")); // Returns a valid dummy source
        }
    });
Morten
  • 311
  • 4
  • 8
3

Probably related to an EntityResolver issue. You can take a look here:

How to read well formed XML in Java, but skip the schema?

Community
  • 1
  • 1
Alexis Dufrenoy
  • 11,784
  • 12
  • 82
  • 124
  • I assume you're talking about the accepted answer in that thread. I tried the suggestion (my code is included in the edit), but I still get the same exception. Am I missing something? – Dave Mar 28 '12 at 17:00
  • 1
    @Dave - Yes. Returning null is the same as not applying an entity resolver. You need to follow the other path of returning an Input source. Just change "foo.dtd" to the system id of the DTD for your XML file. – Alohci Mar 28 '12 at 18:20
  • Yes, if you look at the sample code in the answer, you will see the resolveEntity() method returns an actual non-null InputSource if the XML refers to a DTD, and null else. – Alexis Dufrenoy Mar 29 '12 at 11:40
  • I understand what you're saying with the example. My question, though, is that I don't want any validation/DTD parsing. I just want a bunch of elements. Can I get the parser to ignore any DTD references? – Dave Mar 29 '12 at 14:42