I am trying to run some XPath Queries on XML in Java and apparently the recommended way to do so is to construct a document first.
Here is the standard JAXP code sample that I was using:
import org.w3c.dom.Document;
import javax.xml.parsers.*;
final DocumentBuilder xmlParser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document doc = xmlParser.parse(xmlFile);
I also tried the Saxon API, but got the same errors:
import net.sf.saxon.s9api.*;
final DocumentBuilder documentBuilder = new Processor(false).newDocumentBuilder();
final XdmNode xdm = documentBuilder.build(new File("out/data/blog.xml"));
Here is a minimal reconstructed example XML which the DocumentBuilder
in JDK 1.8 can't parse:
<?xml version="1.1" encoding="UTF-8" ?>
<xml>
<![CDATA[Some example text with [funny highlight]]]>
</xml>
According to the spec, the square bracket ]
just before the end of CDATA marker ]]>
is perfectly legal, but the parser just exits with a stack trace and the message org.xml.sax.SAXParseException; XML document structures must start and end within the same entity.
.
On my original data file which contains a lot of CDATA sections, the message is instead org.xml.sax.SAXParseException; The element type "item" must be terminated by the matching end-tag "</item>"
. In both cases ´com.sun.org.apache.xerces´ shows up in the stacktrace a lot.
Form both observations it seems as if the parser just didn't end the CDATA section at ]]>
.
EDIT: As it turned out, the example will pass when the <?xml ... ?>
declaration is omitted. I hadn't checked that before posting here and added it just now.