3
log.info(new String(new byte[]{-7, 'a'}, "UTF-8").length());

Oracle Java 1.8.0_60 leads to a printout of "2", 1.7.0_79 to "1".

?

user1050755
  • 11,218
  • 4
  • 45
  • 56
  • http://stackoverflow.com/questions/25404373/java-8-utf-8-encoding-issue-java-bug – user1050755 Sep 10 '15 at 22:15
  • What is actually the meaning of that `-7` element? That byte value is not used in UTF-8. – roeland Sep 11 '15 at 04:47
  • 1
    @roeland: Java does not have unsigned data types. A `byte` value of -7 is 0xF9 in hex, and `'a'` is 0x61. However, `0xF9 0x61` is not a valid UTF-8 byte sequence. So, it appears that 1.8.0.60 is interpreting the 0xF9 byte individually as an illegal byte and thus decoding the two bytes into a `'?a'` string, whereas 1.7.0.79 is interpreting the bytes together as an illegal sequence and decoding them to a `'?'` string. `0xF9` is the start byte of a 5-byte UTF-8 sequence, which is illegal in the modern UTF-8 standard, and is not implemented by Java. – Remy Lebeau Sep 12 '15 at 04:09
  • 1
    @roeland: this is actually a bug in Java7 that was fixed in Java8. See [Java 8 UTF-8 encoding issue (java bug?)](http://stackoverflow.com/questions/25404373/), and maybe also [JDK-8039751 UTF-8 decoder fails to handle some edge cases correctly](https://bugs.openjdk.java.net/browse/JDK-8039751) – Remy Lebeau Sep 12 '15 at 04:17

1 Answers1

3

You're passing in an invalid UTF-8 sequence. From the docs

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.

So in a correct implementation the returned String may as well be "Hello world!".

Holger
  • 285,553
  • 42
  • 434
  • 765
roeland
  • 5,349
  • 2
  • 14
  • 28