Invalid byte 2 of 4-byte UTF-8 sequence, but only when executing JAR?

Question

I have this java program where I transform with TransformerFactory a XML string that I get from a SQL Server database and write it to a file, and then use this file to generate a PDF.

The thing is that it works fine when I execute it with netbeans, but if I execute the jar in the project dist folder I get a "Invalid byte 2 of 4-byte UTF-8 sequence".

After changing the encoding of the XML string to UTF-8 now it works fine from the jar too.

So my question is, why would it work when running the project in NetBeans but not from the JAR file before changing the encoding?

Have tried this only in Windows.

Code:

Here is the SQL Server query (original):

SQLXML xml = null;
String xmlString = "";
while (rs.next()){
    xml = rs.getSQLXML(1);
    xmlString = xml.getString();
}
return xmlString;

...and modified:

SQLXML xml = null;
String xmlString = "";
while (rs.next()){
    xml = rs.getSQLXML(1);
    // Note explicit UTF-8 encoding specified
    xmlString = new String(xml.getString().getBytes(),"UTF8");
 }
 return xmlString;

And here the transformation:

public static void serialize(Document doc, OutputStream out) throws Exception {
    TransformerFactory tfactory = TransformerFactory.newInstance();
    try {
        Transformer serializer = tfactory.newTransformer();
        serializer.setOutputProperty("indent", "yes");
        serializer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
        serializer.transform(new DOMSource(doc), new StreamResult(out));
    } catch (TransformerException e) {
        e.printStackTrace();
        throw new RuntimeException(e);
    }
}

Are you, by chance, using one of the conversion mechanisms that uses a "platform specific default encoding"? (e.g. it is assuming UTF-8 for something UTF-16.) And, is the JVM used the same in both instances? Posting the relevant code would be ... useful. — , Nov 10 '11 at 02:07
Thanks for answering! Added code now. I'm not sure about that, what would one of those mechanisms be? I think SQLServer returns the xml string unicode encoded, so that's why I get that error, but why would it work through netbeans without changing the encoding? — Daniel Montes de Oca, Nov 10 '11 at 02:22
I bet a different JVM is being used in each case: can that proposition be confirmed or rejected? — , Nov 10 '11 at 02:24
That could be it! Thanks a lot, let me check this out and I'll get back to you. What would be the easiest way to confirm this? — Daniel Montes de Oca, Nov 10 '11 at 02:28
I don't use netbeans :) But `java -version` should spit out useful things. — , Nov 10 '11 at 02:40
Ok :P I know I'm running 1.6.0.26 from command, but I'm still trying to find what netbeans is using. I wasn't even aware it could be using a different version. — Daniel Montes de Oca, Nov 10 '11 at 02:42
I tested and it still doesn't work even when using same version :( I even reinstalled jdk and jre and change the jdk path in netbeans to make sure it was the same. — Daniel Montes de Oca, Nov 10 '11 at 03:38
The plot thickens, I am out of ideas, but [How to Find Default Charset/Encoding in Java?](http://stackoverflow.com/questions/1749064) indicates it is in a System property, which might be configured differently in one environment. This [article](http://wiki.clinicaltools.com/NetBeans:UTF-8_Character_Set) tells out to change the default encoding, in the NetBeans config (I am not sure if this is just for the IDE or also projects run from it :-) — , Nov 10 '11 at 05:45
If you [ever] get the issue figured out, don't forget to post a self-answer (and accept it) :) — , Nov 11 '11 at 00:55

score 2 · Accepted Answer · answered Jan 05 '12 at 22:29

I've tried a simple Application in Netbeans that displays the Charset.defaultCharset(), and it returns "UTF-8". The same one in Eclipse returns "MacRoman". I'm on a Mac, on Windows it'd return "cp-1252".

So yes, when you run an Application in Netbeans, it defaults to UTF-8 encoding, that's why you didn't have any issues when parsing the XML.

Invalid byte 2 of 4-byte UTF-8 sequence, but only when executing JAR?

1 Answers1