Encoding Issue Causing MalformedByteSequenceException in Xerces UTF8Reader

Question

I'm encountering com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException with an XML file. I stepped through the Xerces code with a debugger and narrowed down the area where this was ocurring. I was able to determine that by removing the "smart quote" characters in the document, the document becomes parseable.

The document came with no DTD. Notepad++ pegs it as "ANSI as UTF-8". Firefox pegs it as "Western". I recall from a not-so-breathtaking lecture in college that UTF-8 was designed to be backward-compatible with single-byte encoding systems. I also see that on this chart, the byte sequence e2 80 9d is, in fact, representative of a "RIGHT DOUBLE QUOTATION MARK", but even though I can't see an encoding problem, I'm thinking there is one.

The exception message I'm getting from Xerces is Invalid byte 3 of 3-byte UTF-8 sequence. It's getting thrown from the invalidByte(3, 3, b2) call on line 435 of UTF8Reader. When I try to fully understand the logic of this method, my brain begins to melt out of my ears a little so I could be missing something, but as I mentioned above byte 3 (0x90). at least of the sequence above, is valid according to the UTF-8 table.

Here is the segment of the file where the double quote occurs shown in a hex editor: XML segment

I have tried the following:

Forcing the String to be loaded using UTF-8 via Charset.forName("UTF-8")
Adding the DTD <?xml version="1.0" encoding="UTF-8"?>
Opening the file in Notepad++ and encoding it as UTF-8 through its UI
Various combinations of the above, sometimes repeatedly

The byte indicated as invalid seems to be 63 (0x3F?) eclipse screenshot

I've also tried adding this smart quote character to a document that was previously parseable. As expected, it makes the parser throw up the same exception.

Stack trace:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)

...

Update: I still need to find a way to safely convert this to a String. I've encoded the file as UTF-8 using Notepad++. The code below successfully loads the bytes into a String (I can see read the XML in the String when when debugging in Eclipse), but now I'm getting MalformedByteSequenceException with different parameters. This time, I can post both the code and XML I'm using:

File file = new File("ccd.xml");

byte[] ccdBytes = org.apache.commons.io.FileUtils.readFileToByteArray(file);
String ccdString = new String(ccdBytes, Charset.forName("UTF-8"));

CDAUtil.load(new ByteArrayInputStream(IOUtils.toByteArray(ccdString))); //method that's doing the parsing

Stack Trace:

Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:557)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2815)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at org.openhealthtools.mdht.emf.runtime.resource.impl.FleXMLLoadImpl.load(FleXMLLoadImpl.java:55)
    at org.eclipse.emf.ecore.xmi.impl.XMLResourceImpl.doLoad(XMLResourceImpl.java:180)
    at org.eclipse.emf.ecore.resource.impl.ResourceImpl.load(ResourceImpl.java:1494)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:268)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:250)
    at org.openhealthtools.mdht.uml.cda.util.CDAUtil.load(CDAUtil.java:238)

However,

CDAUtil.load(new FileInputStream(new File("ccd.xml")));

works

If you are using Notepad++ then try "Convert to UTF-8", which makes the coding explicit in the file itself. NP++ uses "ANSI as UTF-8" to show that is is (probably) UTF-8 but without a BOM. Looking at the hex, `9d` = `10011101` is a valid third byte in a UTF-8 combination starting with `e` = `1110` — rossum, Jun 08 '15 at 12:14
The 465 of the class you pointed is ` if (b1 == -1) {` which does not throw MalformedByteSequenceException. Morover I don't see throwing this exception from any code in this class. Are you sure you pointed right version of lib source code ? — walkeros, Jun 08 '15 at 12:17
@necreaux, not really but I'll add the stack trace up to the point where libs are called — kwikness, Jun 08 '15 at 13:00
@rossum, I mentioned that I tried that "Opening the file in Notepad++ and encoding it as UTF-8 through its UI" — kwikness, Jun 08 '15 at 13:01
@walkeros, I changed the link to point to a pastebin with the actual code being used in my instance and the corrected line number — kwikness, Jun 08 '15 at 13:05
About NP++, there are two different menu options under 'Encoding': "Encode in UTF-8" and "Convert to UTF-8" Try both. — rossum, Jun 08 '15 at 14:12

score 6 · Accepted Answer · edited May 23 '17 at 12:09

6

You did not told us how you pass the file to Xerces. You can do in a different ways with different results. You can read a more detailed explanation about xml encoding issues here

I suggest you to do the following one:

Open the file with notedpad++ and add <?xml version="1.0" encoding="UTF-8"?> as first line if is missing
In Notepad++ do a convertion to UTF-8 (without bom) (it should be in the format menu but I'm using the italian version of notepad++ so I'm guessing about menu translation)
Save the file
In Java open it as an InputStream,i.e. pass to xml parser an InputStream and NOT as a Reader subclass

This should solve the problem, if you can past your code that pass the file to parser it's easier to find the problem.

The steps solve the problem because the first line in xml with the encoding declaration is considered by the parser only if you are using an InputStream i.e a stream of bytes. If you read a stream of bytes you need an encoding declaration in order to specify how to convert bytes into chars.

If you are passing the a String the first line is useless because you are passing a stream of chars and there is no need of encoding.

If you want to use a String you must read the file as a InputStream and convert to a Reader specifying a charset (some thing like InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");)

My guess is that you were getting the error because you did not specify the charset and Java picked up the one of your os (Windows-1252).

edited May 23 '17 at 12:09

Community

1
1

answered Jun 08 '15 at 13:22

Giovanni

3,951
2
24
30

Did you read my entire post? I mentioned that I tried 3 of the 4 items you mentioned. – kwikness Jun 08 '15 at 13:24
1

yes I read the post but if you did only the first three steps and not the fourth one you can still get the exception especially if the file is in UTF-8 and you are working on windows. – Giovanni Jun 08 '15 at 13:30
wow. that actually worked. please provide some details on why and i'll give you best answer. thanks. – kwikness Jun 08 '15 at 13:33
also details on how I can load this document into a String that doesn't break the parser would be helpful – kwikness Jun 08 '15 at 13:35
@kwikness Done; if you downvote my question please change to an upvote thanks. – Giovanni Jun 08 '15 at 13:45
thanks for your help so far. still having trouble safely getting this into a string though. added details to my question – kwikness Jun 08 '15 at 14:42
1

You are rading the file as bytes[] (ccdBytes) then converting to String (ccdString) and then to byte[] again (IOUtils.toByteArray(ccdString)) you can user ccdBytes as argument for ByteArrayInputStream. anyway if you want to use IOUtils.toByteArray you must use IOUtils.toByteArray(ccdString,"UTF-8") since your xml file has an encoding of UTF-8 in the first line – Giovanni Jun 08 '15 at 14:52
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/79967/discussion-between-kwikness-and-giovanni). – kwikness Jun 08 '15 at 14:55

score 2 · Answer 2 · answered Jun 08 '15 at 13:27

I'm only able to get that error message when I have an actual UTF-8 coding error in my input file. Therefore, I assume that there is an actual error in the file somewhere that you can't find.

Here's my test code:

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;

public class ParseAXml {
  public static void main(String argv[]) throws Exception {
    String xmlFile = argv[0];
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document document = builder.parse(xmlFile);
    System.out.println("Parsed Successfully");
  }
}

When I pass it a correct file - containing smart quotes as you have it - I get the expected Parsed Successfully message. Here's my regular test file:

$ hexdump -C tmp.xml
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  74 68 69 6e 67 3e 3c 61  |F-8"?>.<thing><a|
00000030  3e 54 68 69 73 20 e2 80  9c 71 75 6f 74 65 e2 80  |>This ...quote..|
00000040  9d 20 63 6f 75 6c 64 20  67 65 74 20 74 72 69 63  |. could get tric|
00000050  6b 79 3c 2f 61 3e 3c 2f  74 68 69 6e 67 3e 0a     |ky</a></thing>.|
0000005f

When I test a file with an error in it - made by mangling the byte at offset 0x38 - I get the exception you see:

$ java ParseAXml tmp.err.xml
Exception in thread "main" com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 3 of 3-byte UTF-8 sequence.
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:687)
    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:435)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1753)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(XMLEntityScanner.java:1426)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2807)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:347)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
    at ParseAXml.main(ParseAXml.java:10)

So to help you out, I wrote a short java program that attempts to find the malformed byte:

import java.nio.*;
import java.nio.charset.*;
import java.io.*;

public class FindBadUTF8 {
  public static void main(String argv[]) throws Exception {
    String filename = argv[0];
    InputStream inStream = new FileInputStream(filename);

    CharsetDecoder d=Charset.forName("UTF-8").newDecoder();
    CharBuffer out = CharBuffer.allocate(1);
    ByteBuffer in = ByteBuffer.allocate(10);
    in.clear();
    long offset = 0L;
    while (true) {
      int read = inStream.read();
      if (read != -1) {
        in.put((byte)read);
      }
      out.clear();
      in.flip();
      CoderResult cr = d.decode(in, out, (read == -1));
      if (cr.isError()) {
        if (read != -1) {
          System.out.println("Error at offset " + offset + ": " + cr);
          return;
        } else {
          System.out.println("Error at end-of-file: " + cr);
          return;
        }
      }
      if (cr.isUnderflow()) {
        in.position(in.limit());
        in.limit(in.capacity());
      } else {
        in.clear();
      }
      if (read == -1) {
        break;
      }
      offset += 1L;
    }
    System.out.println("OK");
  }
}

That program, when run against my sample file with an error in it gives me this:

$ java FindBadUTF8 tmp.err.xml
Error at offset 56: MALFORMED[2]

And indeed, offset 56 (which is 0x38 in hex) is the byte I mangled:

$ hexdump -C tmp.err.xml
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  74 68 69 6e 67 3e 3c 61  |F-8"?>.<thing><a|
00000030  3e 54 68 69 73 20 e2 80  ff 71 75 6f 74 65 e2 80  |>This ...quote..|
00000040  9d 20 63 6f 75 6c 64 20  67 65 74 20 74 72 69 63  |. could get tric|
00000050  6b 79 3c 2f 61 3e 3c 2f  74 68 69 6e 67 3e 0a     |ky</a></thing>.|
0000005f

Thanks that was extremely helpful. I can now be sure the file is fine. I ran my sample through this and it looks like the file is in fact not corrupted. Here is a file that was working but became unparseable when I added a smartquote: http://pastebin.com/cerU79jz — kwikness, Jun 08 '15 at 13:49

Encoding Issue Causing MalformedByteSequenceException in Xerces UTF8Reader

2 Answers2