3

I am trying to parse 11384 XML files into one SQLite database. One of them:

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (C) 2009/2010/2011 Ulrich Apel.
This work is distributed under the conditions of the Creative Commons
Attribution-Share Alike 3.0 Licence. This means you are free:
* to Share - to copy, distribute and transmit the work
* to Remix - to adapt the work

Under the following conditions:
* Attribution. You must attribute the work by stating your use of KanjiVG in
  your own copyright header and linking to KanjiVG's website
  (http://kanjivg.tagaini.net)
* Share Alike. If you alter, transform, or build upon this work, you may
  distribute the resulting work only under the same or similar license to this
  one.

See http://creativecommons.org/licenses/by-sa/3.0/ for more details.
-->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd" [
<!ATTLIST g
xmlns:kvg CDATA #FIXED "http://kanjivg.tagaini.net"
kvg:element CDATA #IMPLIED
kvg:variant CDATA #IMPLIED
kvg:partial CDATA #IMPLIED
kvg:original CDATA #IMPLIED
kvg:part CDATA #IMPLIED
kvg:number CDATA #IMPLIED
kvg:tradForm CDATA #IMPLIED
kvg:radicalForm CDATA #IMPLIED
kvg:position CDATA #IMPLIED
kvg:radical CDATA #IMPLIED
kvg:phon CDATA #IMPLIED >
<!ATTLIST path
xmlns:kvg CDATA #FIXED "http://kanjivg.tagaini.net"
kvg:type CDATA #IMPLIED >
]>
<svg xmlns="http://www.w3.org/2000/svg" width="109" height="109" viewBox="0 0 109 109">
<g id="kvg:StrokePaths_0ff01" style="fill:none;stroke:#000000;stroke-width:3;stroke-linecap:round;stroke-linejoin:round;">
<g id="kvg:0ff01">
    <path id="kvg:0ff01-s1" d="M54.5,15.79c0,6.07-0.29,55.49-0.29,60.55"/>
    <path id="kvg:0ff01-s2" d="M54.5,88 c -0.83,0 -1.5,0.67 -1.5,1.5 0,0.83 0.67,1.5 1.5,1.5 0.83,0 1.5,-0.67 1.5,-1.5 0,-0.83 -0.67,-1.5 -1.5,-1.5"/>
</g>
</g>
<g id="kvg:StrokeNumbers_0ff01" style="font-size:8;fill:#808080">
    <text transform="matrix(1 0 0 1 45 16)">1</text>
    <text transform="matrix(1 0 0 1 45 88)">2</text>
</g>
</svg>

I'm using SAX parser:

public class SaxKanjivgHandler extends DefaultHandler {
.....
        File folder = new File(KANJIVG_DIRECTORY);
        if (folder.isDirectory()) {
            File[] listOfFiles = folder.listFiles();

            for (File file : listOfFiles) {
                if (file.isFile()) {
                    currentFileName = file.getName();
                    readXmlFromFile(file);
                }
            }
        }
.....
    public void readXmlFromFile(File file) throws ParserConfigurationException,
            SAXException, IOException {

        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser parser = factory.newSAXParser();
        parser.parse(file, this);

    }

When I am parsing files, I am getting this error:

Exception in thread "main" java.net.SocketException: Connection reset at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at sun.net.www.MeteredStream.read(Unknown Source) at java.io.FilterInputStream.read(Unknown Source) at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager$RewindableInputStream.read(Unknown Source) at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipSpaces(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanEntityDecl(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDecls(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDTDScannerImpl.scanDTDExternalSubset(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.dispatch(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$DTDDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(Unknown Source) at SaxKanjivgHandler.readXmlFromFile(SaxKanjivgHandler.java:63) at SaxKanjivgHandler.(SaxKanjivgHandler.java:44) at Main.main(Main.java:28)

Firstly, I thought that this error was because of one exact file. But an error is happening with different files in different times. How to make SAX parser stop connecting to the Internet?

Joe Rakhimov
  • 4,713
  • 9
  • 51
  • 109
  • 3
    I'm just guessing, but it's probably that you haven't [turned off DTD validation](https://stackoverflow.com/questions/1185519/how-to-read-well-formed-xml-in-java-but-skip-the-schema) – hd1 Feb 11 '15 at 17:46

1 Answers1

1

You can supply your own EntityResolver

public class DummyEntityResolver implements EntityResolver {
    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {

        return new InputSource(new StringReader(""));
    }
}

and

public void readXmlFromFile(File file) throws ParserConfigurationException,
        SAXException, IOException {

    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser parser = factory.newSAXParser();
    parser.getXMLReader().setEntityResolver(new DummyEntityResolver());
    parser.parse(file, this);

}

This stops external entity resolution. If you have some external entities, you want to provide, you can do this checking publicID and systemID.

HTH.

mp911de
  • 17,546
  • 2
  • 55
  • 95
  • 1
    Also you might like to note that Saxon has an EntityResolver, net.sf.saxon.lib.StandardEntityResolver, which knows about the most common W3C DTDs and external entity files, and redirects them to local copies held within the Saxon JAR file. – Michael Kay Feb 11 '15 at 22:01
  • Above solution does not work with parser. Use XMLReader.... XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setEntityResolver(new DtdResolver()); – user001 Sep 01 '15 at 09:55