3

My problem is as follows. I am reading in an XML-file whose text nodes partially contain the UTF-8 version of opening and closing double quotes. The text is extracted, shortened to 3999 bytes and put into a new XML-Format, which is then saved as a file.

While both signs are displayed correctly by Notepad++ in the input file, the output file contains invalid utf-8 characters, not even Notepad++ is able to display.

The openeing double quotes are printed correctly, but the closing ones are disfigured.

Using a Hex-Editor, I found ot that the code units are somehow changed from

E2 80 9D

in the input file to

E2 80 3F

in the output file. I am using the sax-parser for the xml-parsing.

Are there any known bugs that could cause such a behaviour?

vog
  • 23,517
  • 11
  • 59
  • 75
LuigiEdlCarno
  • 2,410
  • 2
  • 21
  • 37
  • 2
    Can you post a *minimal* (but complete) program which displays this behaviour, ideally with a sample XML file (possibly hosted elsewhere, so we can get the exact binary data)? – Jon Skeet Jan 17 '13 at 12:34
  • Well it seems that the file is being decoded as Windows-1252. That would explain the result since decoding `e2 80 9d` in Windows-1252 comes out as `e2 80 3f` where `3f` is `?`, the replacement character for `9d` being unassigned in Windows-1252. – Esailija Jan 17 '13 at 12:37

3 Answers3

1

E2 80 9D is a valid byte sequence for UTF-8, giving '”' = \u201d'. You can see this as all high bits are set. This is a laudable safety property of UTF, so not erroneously ASCII can be found in such a sequence, like '/'.

In the second sequence 3F ('?') has no high bit set in the byte, so is wrong. This means that the reading went wrong (question mark) or so. Like converting twice, replacing. Especially 9D is in the extended Windows Latin-1 aka Cp1252 (80 - 9F).

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
1

Not a known bug but a common mistake to leave encoding out when reading files or writing them - resulting in the platform default encoding used which is Windows-1252 in this case.

When you initially read the file, you should specify UTF-8 decoding and when writing to a new file, you should do specify UTF-8 encoding. If you post your implementation I can correct it in place.

How this can be reproduced:

byte[] quoteutf8 = {(byte)0xE2, (byte)0x80, (byte)0x9D};
String decodedPlatformDefault = new String(quoteutf8, "Windows-1252");
byte[] encodedPlatformDefault = decodedPlatformDefault.getBytes("Windows-1252");

for( byte i : encodedPlatformDefault ) {
    System.out.print(String.format( "%02x ", i ));
   //e2 80 3f   
}
Esailija
  • 138,174
  • 23
  • 272
  • 326
0

You should always specify the character set name when creating new strings from byte arrays, and when returning byte arrays from Strings.

If not, the default charset for your system will be taken, potentially creating problems everywhere...

Instead of

new String(myByteArray);
//... and...
myString.getBytes();

you should use

new String(myByteArray, "UTF-8");
//... and...
myString.getBytes("UTF-8");

for example

Transformer transformer = TransformerFactory.newInstance().newTransformer();

transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");

StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(xmlDocument);
transformer.transform(source, result);

return result.getWriter().toString().getBytes("UTF-8");

Since Java 1.6, you can specify a Charset instead of a String containing the charset name:

Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243