Java Strings Character Encoding - For French - Dutch Locales

Question

I have the following piece of code

public static void main(String[] args) throws UnsupportedEncodingException {
        System.out.println(Charset.defaultCharset().toString());

        String accentedE = "é";

        String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
        System.out.println(utf8);
        utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
        System.out.println(utf8);
        utf8 = new String(accentedE.getBytes("utf-8"));
        System.out.println(utf8);
        utf8 = new String(accentedE.getBytes());
        System.out.println(utf8);
}

The output of the above is as follows

windows-1252
é
?
Ã©
é

Can someone help me understand what does this do ? Why this output ?

to get the expected out put make sure that you set the file encoding type 'UTF8' for the file. If you are using eclipse right click on file select properties and select the utf8 as the text file encoding type. — user964147, Mar 19 '13 at 13:21

Esailija · Answer 1 · 2013-03-20T16:39:32.930

If you already have a String, there is no need to encode and decode it right back, the string is already a result from someone having decoded raw bytes.

In the case of a string literal, the someone is the compiler reading your source as raw bytes and decoding it in the encoding you have specified to it. If you have physically saved your source file in Windows-1252 encoding, and the compiler decodes it as Windows-1252, all is well. If not, you need to fix this by declaring the correct encoding for the compiler to use when compiling your source...

The line

String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));

Does absolutely nothing. (Encode as UTF-8, Decode as UTF-8 == no-op)

The line

utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));

Encodes string as Windows-1252, and then decodes it as UTF-8. The result must only be decoded in Windows-1252 (because it is encoded in Windows-1252, duh), otherwise you will get strange results.

The line

utf8 = new String(accentedE.getBytes("utf-8"));

Encodes a string as UTF-8, and then decodes it as Windows-1252. Same principles apply as in previous case.

The line

utf8 = new String(accentedE.getBytes());

Does absolutely nothing. (Encode as Windows-1252, Decode as Windows-1252 == no-op)

Analogy with integers that might be easier to understand:

int a = 555;
//The case of encoding as X and decoding right back as X
a = Integer.parseInt(String.valueOf(a), 10);
//a is still 555

int b = 555;
//The case of encoding as X and decoding right back as Y
b = Integer.parseInt(String.valueOf(b), 15);
//b is now 1205 I.E. strange result

Both of these are useless because we already have what we needed before doing any of the code, the integer 555.

There is a need for encoding your string into raw bytes when it leaves your system and there is a need for decoding raw bytes into a string when they come into your system. There is no need to encode and decode right back within the system.

"(Encode as Windows-1252, Decode as Windows-1252 == no-op)" – not true. It will mangle all the characters that are not available in Windows 1252 and turn them into question marks. — Karol S, Sep 06 '14 at 21:11
@KarolS you are taking that line out of context to sound smart lol — Esailija, Sep 09 '14 at 08:51

Stephen C · Answer 2 · 2013-03-19T13:46:05.037

Line #1 - the default character set on your system is windows-1252.

Line #2 - you created a String by encoding a String literal to UTF-8 bytes, and then decoding it using the UTF-8 scheme. The result is correctly formed String, which can be output correctly using windows-1252 encoding.

Line #3 - you created a String by encoding a string literal as windows-1252, and then decoding it using UTF-8. The UTF-8 decoder has detected a sequence that cannot possibly be UTF-8, and has replaced the offending character with a question mark"?". (The UTF-8 format says that any byte that has the top bit set to 1 is one byte of a multi-byte character. But the windows-1252 encoding is just one byte long .... ergo, this is bad UTF-8)

Line #4 - you created a String by encoding in UTF-8 and then decoding in windows-1252. In this case the decoding has not "failed", but it has produced garbage (aka mojibake). The reason you got 2 characters of output is that the UTF-8 encoding of "é" is a 2 byte sequence.

Line #5 - you created a String by encoding as windows-1252 and decoding as windows-1252. This produce the correct output.

And the overall lesson is that if you encode characters to bytes with one character encoding, and then decode with a different character encoding you are liable to get mangling of one form or another.

linski · Answer 3 · 2013-03-19T21:42:11.720

When you call upon String getBytes method it:

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

So whenever you do:

accentedE.getBytes()

it takes the contents of accentedE String as bytes encoded in the default OS code page, in your case cp-1252.

This line:

new String(accentedE.getBytes(), Charset.forName("UTF-8"))

takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:

new String(accentedE.getBytes("utf-8"))

The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructor encodes them with the default OS codepage which is cp-1252.

Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.

I strongly recommend reading this excellent article:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

UPDATE:

In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:

String accentedE = "é";

System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));

which outputs:

C3 
A9 
E9

That is because small accented e in UTF-8 is stored as two bytes of value C3A9, while in cp-1252 is stored as a single byte of value E9. For detailed explanation read the linked article.

Java Strings Character Encoding - For French - Dutch Locales

3 Answers3

Linked