java utf-8 encoding bytes changes in string for odd number of characters

Question

I have following code:

public static void main(String args[]) throws UnsupportedEncodingException {
    System.setProperty("file.encoding", "gbk");

    String name = "こんにちわ";
    String copy = new String(name.getBytes("utf-8"));

    byte[] b1 = name.getBytes("utf-8");
    byte[] b2 = copy.getBytes();

    System.out.println("b1: " + Arrays.toString(b1));
    System.out.println("b2: " + Arrays.toString(b2));
}

The console output is:

b1: [-29, -127, -109, -29, -126, -109, -29, -127, -85, -29, -127, -95, -29, -126, -113]
b2: [-29, -127, -109, -29, -126, -109, -29, -127, -85, -29, -127, -95, -29, -126, 63]

Note the last byte is different in the new String.

Now, if I use the input String name = "こんにち"; (just 4 Japanese Characters) instead, it changes to:

b1: [-29, -127, -109, -29, -126, -109, -29, -127, -85, -29, -127, -95]
b2: [-29, -127, -109, -29, -126, -109, -29, -127, -85, -29, -127, -95]

This time the bytes are exactly same.

I use java jdk1.6.0_45 on windows. Default charset is gbk. Did I meet some encoding limitations?

What is the default charset in your setup? – RealSkeptic Jul 06 '15 at 11:27 — RealSkeptic, Jul 06 '15 at 11:27
I get `"-113"` on both rows, using 1.6.0_45 on windows. – Keppil Jul 06 '15 at 11:30 — Keppil, Jul 06 '15 at 11:30
@RealSkeptic default charset is gbk – oyss Jul 06 '15 at 11:32 — oyss, Jul 06 '15 at 11:32

score 2 · Accepted Answer · answered Jul 06 '15 at 11:56

Basically, the first four lines of your program are equivalent to:

    String name = "こんにちわ";
    byte[] b1 = name.getBytes("utf-8");

    String a = new String( name.getBytes("utf-8"), "gbk" );
    byte[] b2 = a.getBytes("gbk");

That is, you are taking a byte array (b1) that is the UTF-8 representation of your Japanese string, and telling Java "this byte array is in GBK encoding, convert it into a text".

This will not work, and if you print your a string, you'll see that it does not print Japanese text but rather some Chinese gibberish - plus the replacement character ("�") at the end.

Internally, Java strings are encoded in UTF-16. But when you convert to and from byte arrays, you have to specify the encoding. Encodings are different from one another, and may use the same byte value or sequence of byte values to represent a totally different character.

And in this case, there are byte sequences in UTF-8 which are not legal in GBK, and therefore, Java is replacing them with the replacement character for you.

If you want to create a new string from b1 and for it to still be こんにちわ, you need to create a telling Java that the bytes are in UTF-8.

    String a = new String( name.getBytes("utf-8"), "utf-8" );

Then, your a will be equal to name.

Then, if you just do a.getBytes(), you'll get the bytes that represent that string in GBK. It will be different than b1 because it's in a different encoding. To get the same array, you need to use the same encoding (a.getBytes("utf-8")).

Try not to rely on your Java's default character set. Always specify the exact character set when you get bytes from a string, and when you convert bytes into a string.
Different character sets produce different byte arrays for the same string.
getBytes() and String(byte[]) without charset parameter do not give you the real byte sequence that underlies the String. They use the JVM's default character set - in your case, GBK.

I tried "new String( name.getBytes("utf-8"),"utf-8" )"; but it gives me bytes like this. "-92""-77""-92""-13""-92""-53""-92""-17""-92""-17" completely different from "こんにわわ".getBytes("utf-8"). These bytes is uesd for hashing so they need to be exactly same in bytes to produce same hash number. — oyss, Jul 06 '15 at 12:14
Did you see that I said " To get the same array, you need to use the same encoding (`a.getBytes("utf-8")`)"? — RealSkeptic, Jul 06 '15 at 12:24

score 1 · Answer 2 · edited May 23 '17 at 11:58

You are facing a common problem, you are using the default platform encoding with a sequence of bytes encoded differently. This line

byte[] b1 = name.getBytes("utf-8");

Converts you string to a byte[] using the utf-8 encoding. This line:

String a = new String( name.getBytes("utf-8"));

Creates a string from byte array, but without specifying the charset. This can be a problem because the jvm "picks up it own value"; please note that the String class has also the following constructor:

String(byte[] bytes, Charset charset)

that allows you to specify how to create a string from a sequence on bytes using the encoding you passed as the second parameter.

The line String a = new String( name.getBytes("utf-8")); is using the default plaform enconding that, reading from your comment, is gbk. So what you are really doing is:

String a = new String( name.getBytes("utf-8"),"gbk");

instead of

String a = new String( name.getBytes("utf-8"),"UTF-8");

The "tricky" part is that some encodings overlap i.e. they convert some (but not all) symbols with the same sequence of bytes; so they represent some strings in the some way, but some other in a different way. ISO8859-1,for example, represents the chars in the same way of ISO8859-15 except for the € and some other chars (ISO8859-15 was introduced to have a single byte enconding with the € sign you can see the differences here), so the sequence of byte representing a string in the same in ISO8859-1 and ISO8859-15 if the string does not contain the € symbol.

If you like to read something related to java encoding also with xml you can have a look at this one

score 1 · Answer 3 · answered Jul 06 '15 at 12:02

1

No, you did not meet encoding limitations, but your code uses default charset by variable a and b2

Try this:

    String a = new String(name.getBytes("UTF-8"),"UTF-8");
    byte[] b2 = a.getBytes("UTF-8");

answered Jul 06 '15 at 12:02

maloku

153
1
7

java utf-8 encoding bytes changes in string for odd number of characters

3 Answers3