JAXP parse error on valid XML

Question

I am trying to run some XPath Queries on XML in Java and apparently the recommended way to do so is to construct a document first.

Here is the standard JAXP code sample that I was using:

import org.w3c.dom.Document;
import javax.xml.parsers.*;

final DocumentBuilder xmlParser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document doc = xmlParser.parse(xmlFile);

I also tried the Saxon API, but got the same errors:

import net.sf.saxon.s9api.*;

final DocumentBuilder documentBuilder = new Processor(false).newDocumentBuilder();
final XdmNode xdm = documentBuilder.build(new File("out/data/blog.xml"));

Here is a minimal reconstructed example XML which the DocumentBuilder in JDK 1.8 can't parse:

<?xml version="1.1" encoding="UTF-8" ?>
<xml>
    <![CDATA[Some example text with [funny highlight]]]>
</xml>

According to the spec, the square bracket ] just before the end of CDATA marker ]]> is perfectly legal, but the parser just exits with a stack trace and the message org.xml.sax.SAXParseException; XML document structures must start and end within the same entity..

On my original data file which contains a lot of CDATA sections, the message is instead org.xml.sax.SAXParseException; The element type "item" must be terminated by the matching end-tag "</item>". In both cases ´com.sun.org.apache.xerces´ shows up in the stacktrace a lot.

Form both observations it seems as if the parser just didn't end the CDATA section at ]]>.

EDIT: As it turned out, the example will pass when the <?xml ... ?> declaration is omitted. I hadn't checked that before posting here and added it just now.

What version of Java are you using, exactly? That XML parses fine in Java 1.8.0_161. — VGR, Feb 17 '18 at 18:03
Oh dear! I have a 64 MB XML file here where the JDK parser (Java 8 as well as Java 9!) fails at just this `]]]>` bit, but I failed (for now) to find a smaller example snippet to post here which also fails. Maybe I need to include namespaces and such. But now I have a stiff neck from looking at screen for too long and need to go play JustDance to save my body from becoming all stone. — Robert Jack Will, Feb 17 '18 at 21:55

score 2 · Answer 1 · answered Feb 17 '18 at 16:35

Short answer: add Apache Xerces to the build path, it will automatically be loaded instead of the parser from the JDK and the XML will be parsed just fine! Copy-paste Gradle Dependency:

implementation "xerces:xercesImpl:2.11.0"

Some background: Apache Xerces is indeed the same parser which is also used in the JDK, but even though Xerces 2.11 dates from 2013 the JDK comes with a much older version. That really sucks!

As the Saxon team puts it:

Saxonica recommends use of the Xerces parser from Apache in preference to the version bundled in the JDK, which is known to have some serious bugs.

In case you wonder how simply putting Xerces on the classpath makes the problem disappear: even though the JDK and Saxon DocumentBuilders construct entirely different document types, they both use the same Standard Java Interfaces to call the parser and also the same mechanism to find and load the parser (or rather, the parser factory). In short, a java.util.ServiceLoader is called and looks into all the JARs in the classpath for properties files in META-INF/services and this is how the xercesJar announces that it does provide an XML parser. And good for us, the JDK's own implementation is superseded by anything found there.

After making this bad experience with JDK XML classes, I am even more motivated to refactor projects to use Saxon for XPath processing instead of the implementation of XPath in the JDK. The other reason is the technical advantage of XDM over DOM (same link as above).

You seem to be implying that Java in general has this problem, but [Java 9 uses Xerces 2.11.0.](http://openjdk.java.net/jeps/255) The question does mention it’s using Java 8, so your point is valid, but it’s worth noting that using Java 9 is a valid solution, one which doesn’t require bundling an explicit third party library. — VGR, Feb 17 '18 at 17:52
Thanks for the link! I'll might do some tests with Java 9 when I have time. In general, the description behind the link implies that JDK and Xerces have diverged so much, that (a) both might have different bugs and (b) merging patches is non-trivial. If Xerces continues to stagnate as much, then this will not be a problem, of course. — Robert Jack Will, Feb 17 '18 at 21:15
Test done with the corrected XML snippet in the question. Java 9 fails it, too! Tested on Oracle Java build 1.8.0_152-b16 and on OpenJDK 9-Ubuntu+0-9b161-1. — Robert Jack Will, Feb 18 '18 at 16:46
It’s the prolog. Change it from `version="1.1"` to `version="1.0"` (or omit the prolog entirely) and everything works, both in Java 8 and Java 9. Investigating why… — VGR, Feb 18 '18 at 20:41

JAXP parse error on valid XML

1 Answers1

Linked