5

Some byte arrays using new String (byte [],"UTF-8") return different results in jdk 1.7 and 1.8

byte[] bytes1 = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};
        String str1 = new String(bytes1,"UTF-8");
        System.out.println(str1.length());

        byte[] out1 = str1.getBytes("UTF-8");
        System.out.println(out1.length);
        System.out.println(Arrays.toString(out1));

byte[] bytes2 = {65, -103, -103, 73, 32, 68, 49, 73, -1, -30, -1, -103, -92, 11, -32, -30};
        String str2 = new String(bytes2,"UTF-8");
        System.out.println(str2.length());

        byte[] out2 = str2.getBytes("UTF-8");
        System.out.println(out2.length);
        System.out.println(Arrays.toString(out2));

bytes2 use new String(byte[],"UTF-8"),the result(str2) is not the same in jdk7 and jdk8, but byte1 is same. What is special about bytes2?

Test the "ISO-8859-1" code, the result of bytes2 is the same in jdk1.8!

jdk1.7.0_80:

15
27
[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]
15
31
[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67]

jdk1.8.0_201

15
27
[55, 93, 97, -17, -65, -67, 4, 8, 29, 26, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, -17, -65, -67, 88]
16
34
[65, -17, -65, -67, -17, -65, -67, 73, 32, 68, 49, 73, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 11, -17, -65, -67, -17, -65, -67]
sapeming
  • 53
  • 3
  • 4
    Probably related: https://stackoverflow.com/questions/30575509/java-8-change-in-utf-8-decoding and better: https://stackoverflow.com/questions/25404373/java-8-utf-8-encoding-issue-java-bug – assylias May 07 '19 at 08:03
  • When executing this in Java 10 you get the same result as in java 8. – dan1st May 07 '19 at 08:43

1 Answers1

9

Short answer:

In second byte array last 2 bytes: [-32, -37] (0b11011011_11100000) is encoded as:

By JDK 7: [-17, -65, -67] which is Unicode character 0xFFFD ("invalid character"),
By JDK 8: [-17, -65, -67, -17, -65, -67] which is 2 of 0xFFFD characters.

Long answer:

Some byte sequence in your arrays doesn't appears to be valid UTF-8 sequence. Let's consider this code:

byte[] bb = {55, 93, 97, -13, 4, 8, 29, 26, -68, -4, -26, -94, -37, 32, -41, 88};
for (byte b : bb) System.out.println(Integer.toBinaryString(b & 0xff));

It will print (I added leading underscores manually for readability):

__110111
_1011101
_1100001
11110011
_____100
____1000
___11101
___11010
10111100
11111100
11100110
10100010
11011011
__100000
11010111
_1011000

As you can read in UTF-8 Wikipedia article, utf-8 encoded string, uses following binary sequences:

0xxxxxxx -- for ASCII characters
110xxxxx 10xxxxxx -- for 0x0080 to 0x07ff
1110xxxx 10xxxxxx 10xxxxxx -- for 0x0800 to 0xFFFF
... and so on

So each character that doesn't follow this encoding scheme, is replaced by 3 bytes:

[-17, -65, -67]
In binary 11101111 10111111 10111101
Unicode bits are 0b11111111_11111101
Unicode hex is 0xFFFD (Unicode's "Invalid character")

The only difference in arrays printed by your code is how following characters are processed, those are 2 bytes at the end of your second array:

[-32, -30] is 0b11100000_11100010, and this is not valid UTF-8

JDK 7 generated single 0xFFFD character for this sequence.
JDK 8 generated two 0xFFFD characters for this sequence.

RFC-3629 standard has no clear instructions on how to handle invalid sequences, so it seems that in JDK 8 they decided to generate 0xFFFD per each invalid byte, which seems to be more correct.

The other question, is why you try to parse such raw non UTF-8 bytes as UTF-8 chars, when you should not be doing that?

Community
  • 1
  • 1
semplar
  • 126
  • 1
  • 5