I'm encountering com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException
with an XML file. I stepped through the Xerces code with a debugger and narrowed down the area where this was ocurring. I was able to determine that by removing the "smart quote" characters in the document, the document becomes parseable.
The document came with no DTD. Notepad++ pegs it as "ANSI as UTF-8". Firefox pegs it as "Western". I recall from a not-so-breathtaking lecture in college that UTF-8 was designed to be backward-compatible with single-byte encoding systems. I also see that on this chart, the byte sequence e2 80 9d
is, in fact, representative of a "RIGHT DOUBLE QUOTATION MARK", but even though I can't see an encoding problem, I'm thinking there is one.
The exception message I'm getting from Xerces is Invalid byte 3 of 3-byte UTF-8 sequence.
It's getting thrown from the invalidByte(3, 3, b2)
call on line 435 of UTF8Reader. When I try to fully understand the logic of this method, my brain begins to melt out of my ears a little so I could be missing something, but as I mentioned above byte 3 (0x90). at least of the sequence above, is valid according to the UTF-8 table.
Here is the segment of the file where the double quote occurs shown in a hex editor:
I have tried the following:
- Forcing the String to be loaded using UTF-8 via Charset.forName("UTF-8")
- Adding the DTD
<?xml version="1.0" encoding="UTF-8"?>
- Opening the file in Notepad++ and encoding it as UTF-8 through its UI
- Various combinations of the above, sometimes repeatedly
The byte indicated as invalid seems to be 63 (0x3F?)
I've also tried adding this smart quote character to a document that was previously parseable. As expected, it makes the parser throw up the same exception.
Stack trace:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
...
Update: I still need to find a way to safely convert this to a String. I've encoded the file as UTF-8 using Notepad++. The code below successfully loads the bytes into a String (I can see read the XML in the String when when debugging in Eclipse), but now I'm getting MalformedByteSequenceException with different parameters. This time, I can post both the code and XML I'm using:
File file = new File("ccd.xml");
byte[] ccdBytes = org.apache.commons.io.FileUtils.readFileToByteArray(file);
String ccdString = new String(ccdBytes, Charset.forName("UTF-8"));
CDAUtil.load(new ByteArrayInputStream(IOUtils.toByteArray(ccdString))); //method that's doing the parsing
Stack Trace:
Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
at org.openhealthtools.mdht.emf.runtime.resource.impl.FleXMLLoadImpl.load(FleXMLLoadImpl.java:55)
at org.eclipse.emf.ecore.xmi.impl.XMLResourceImpl.doLoad(XMLResourceImpl.java:180)
at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1494)
at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:268)
at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:250)
at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:238)
However,
CDAUtil.load(new FileInputStream(new File("ccd.xml")));
works