strange behaviour of java getBytes vs getBytes(charset)

Question

consider the following:

public static void main(String... strings) throws Exception {
    byte[] b = { -30, -128, -94 };

    //section utf-32
    String string1 = new String(b,"UTF-32");
    System.out.println(string1);   //prints ?
    printBytes(string1.getBytes("UTF-32")); //prints 0 0 -1 -3 
    printBytes(string1.getBytes());  //prints 63

    //section utf-8
    String string2 = new String(b,"UTF-8"); 
    System.out.println(string2);  // prints •
    printBytes(string2.getBytes("UTF-8"));  //prints -30 -128 -94 
    printBytes(string2.getBytes());  //prints -107 
}

public static void printBytes(byte[] bytes){
    for(byte b : bytes){
        System.out.print(b +  " " );
    }

    System.out.println();
}

output:

?
0 0 -1 -3 
63 
•
-30 -128 -94 
-107

so I have two questions:

in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset
why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)

Maarten Bodewes · Accepted Answer · 2015-07-24T13:50:12.003

Question 1:

in both sections : why the output getBytes() and getBytes(charSet) are different even though I have specifically mentioned the string's charset

The character set you've specified is used during character encoding of the string to the byte array (i.e. in the method itself only). It's not part of the String instance itself. You are not setting the character set for the string, the character set is not stored.

Java does not have an internal byte encoding of the character set; it uses arrays of char internally. If you call String.getBytes() without specifying a character set, it will use the platform default - e.g. Windows-1252 on Windows machines.

Question 2:

why both of the byte outputs of getByte in section utf-32 are different from the actual byte[] b? (i.e. how can I convert back a string to its original byte array?)

You cannot always do this. Not all bytes represent a valid encoding of characters. So if such an encoded array is decoded then these kind of encodings are silently ignored, i.e. the bytes are simply skipped.

This already happens during String string1 = new String(b,"UTF-32"); and String string2 = new String(b,"UTF-8");.

You can change this behavior using an instance of CharsetDecoder, retrieved using Charset.newDecoder.

If you want to encode a random byte array into a String instance then you should use a hexadecimal or base 64 encoder. You should not use a character decoder for that.

score 2 · Answer 2 · answered Jul 24 '15 at 13:51

2

Java String / char (16 bits UTF-16!) / Reader / Writer are for Unicode text. So all scripts may be combined in a text.

Java byte (8 bits) / InputStream / OutputStream are for binary data. If that data represents text, one needs to know its encoding to make text out of it.

So a conversion from bytes to text always needs a Charset. Often there exists an overloaded method without the charset, and then it defaults to the System.getProperty("file.encoding") which can differ on every platform. Using a default is absolutely non-portable, if the data is cross-platform.

So you had the misconception that the encoding belonged to the String. This is understandable, seeing that in C/C++ unsigned char and byte were largely interchangeable, and encodings a nightmare.

answered Jul 24 '15 at 13:51

Joop Eggen

107,315
7
83
138

UTF-16 encodes Unicode, and needs in the mentioned case use a "surrogate" pair of chars to encode one. The encoding is safe as UTF-8, marking high bits. Hence `int String.codePointAt(int i)` and `Character.charCount(int cp)`. You are right to see text in java as amorph "Unicode". On the moment you need bytes, UTF32 might be an option. Casting UTF-8 byte or UTF-16 char will only be partially correct, and is a no-no. – Joop Eggen Jul 24 '15 at 14:37
I thought so to, but [here](http://stackoverflow.com/questions/31613779/is-a-java-char-array-always-a-valid-utf-16-big-endian-encoding) it is mused that that may not be the case and I was wrong. `char` arrays seem to be always UTF-16BE as *long as the underlying string is*. Note the person asking the question :) – Maarten Bodewes Jul 31 '15 at 08:54
@MaartenBodewes yes, it was a visionary laudable design decision at that time, to let `String` be Unicode text, `char` 2 bytes, and binary data use `byte[]`. Upto today one sometimes other languages have their small troubles. – Joop Eggen Jul 31 '15 at 09:08

strange behaviour of java getBytes vs getBytes(charset)

2 Answers2