1

My program uses javax.xml.stream.XMLStreamReader to perform a StAX parsing on a very large XML file from Wiktionary (something like 4Gb).

It works fine for a very long sequence of tags and content, then it raises a very weird exception:

java.lang.ArrayIndexOutOfBoundsException: 8192
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:546)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(XMLEntityScanner.java:1629)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(XMLEntityScanner.java:1667)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1755)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2965)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:558)  
    at XmlParser.getAllTitles(XmlParser.java:36)
    at Main.main(Main.java:9)

The tag with which it is giving this exception seams very normal (</username>), so I can't understand why.

I read on the internet that someone else had gone throught this too, and it seems that in order to fix this I must update the Xerces version. My current Xerces version is: Xerces-J 2.7.1

I use this version of Java:

java version "1.7.0_51"
OpenJDK Runtime Environment (IcedTea 2.4.4) (7u51-2.4.4-0ubuntu0.13.10.1)
OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)

What I need is either make 2.7 version of Xerces work somehow or update the version used by OpenJDK.

I searched deeply for a solution but I didn't come up with anything, so I don't know what to do in both cases.

Paolo Dragone
  • 939
  • 1
  • 11
  • 27

1 Answers1

2

I don't believe the most recent version of xerces (2.11) includes an implementation of the JAXP XMLStreamReader for you to switch to.

For processing large xml files I would suggest switching to a SAX parser which is more work for you, but should process a large XML with the smallest memory footprint. Switching from com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl to org.apache.xerces.jaxp.SAXParserImpl with JRE 1.7+ should only require adding a new xercesImpl.jar + xml-apis.jar from the latest xerces-j to the class path. You can see which one you have with:

SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
System.out.println("Parser class: " + parser.getClass().toString());

An alternate Stax parser is an option as well

Community
  • 1
  • 1
pd40
  • 3,187
  • 3
  • 20
  • 29
  • 1
    I've just implemented a SAX parser but the result is exaclty the same. Here it is my source: http://pastebin.com/K500kyqp – Paolo Dragone Apr 06 '14 at 18:11
  • Updated answer. I tested with OpenJDK 1.7 & SAXParserFactory.newInstance() defaults to com.sun but will switch to xerces if it is in the class path. I did not test it with a 4GB xml file so I hope it helps! – pd40 Apr 06 '14 at 20:48
  • sorry for ignorance, but is it enough to pass the path to the jar files to `javac`? I mean like this: `java -cp /path/to/xerces.jar:/path/to/xml-apis.jar Main.java` I tryied doing like this but when I execute `java -cp /same/paths/as/before Main` I get this error: `Error: Could not find or load main class Main` – Paolo Dragone Apr 07 '14 at 16:32
  • Nevermind, I added also `.` to classpath option and now it works. Anyhow, I still get the same parser class (`com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl`) even though I followed your advice. – Paolo Dragone Apr 07 '14 at 16:50
  • Sorry, I'm wrong again... I downloaded the wrong xml-apis jar, now I get the good parser class. I will try my program in a little while. – Paolo Dragone Apr 07 '14 at 17:11