Java 8 change in UTF-8 decoding

Question

We recently migrated our application to JDK 8 from JDK 7. After the change, we ran into a problem with the following snippet of code.

String output = new String(byteArray, "UTF-8");

The byte array may contain invalid UTF-8 byte sequences. The same byte array upon UTF-8 decoding, results in two difference strings on Java 7 and Java 8.

According to the answer to this SO post, Java 8 "fixes" an error in Java 7 and replaces invalid UTF-8 byte sequences with a replacement string, which is in accordance with the UTF-8 specification.

But we would like to stick with Java 7's version of the decoded string.

We have tried to use CharsetDecoder with CodingErrorAction as REPLACE, REPORT and IGNORE on Java 8. Still, we were not able to generate the same string as Java 7.

Can we do this with a technique of reasonable complexity?

Please post exact input `byteArray` (minimal excerpt from it), so we can reproduce your problem. — Tagir Valeev, Jun 01 '15 at 14:30
If your issue is indeed that there are wrongly encoded surrogate pairs, `CodingErrorAction` won’t help you. Think of `UTF-8` and `modified UTF-8` as being two entirely different encodings. In that case you wouldn’t expect an error recovery option switching to another encoding, would you? So what you need then, is an alternative `Charset` implementation, but that wouldn’t be simpler than the five lines of the linked answer. — Holger, Jun 01 '15 at 14:49
@Holger I am not sure that there are "only" wrongly encoded surrogate pairs, the code actually does something like this new String(hmac.doFinal(byteArray), "UTF-8"). Here hmac is an instance of MAC. This is by no means a valid UTF-8 encoded string (even parts of it). If we follow the solution you mentioned in the other post, we get an exception for Invalid UTF-8 characters. — Jiraiya, Jun 01 '15 at 15:59
So you tell it to misinterpret arbitrary data as `UTF-8`? What is the purpose of this? — Holger, Jun 01 '15 at 16:19
Legacy! :) .I guess someone wanted to covert binary input to character input and obviously they did it the wrong way. The ideal way would be to Base64 encode the digest. Fixing it the right way is sadly, very expensive. Hence I was trying to emulate Java 7's behavior. I am beginning to suspect that the discrepancy may be due to the change in the unicode version btwn Java 7 and Java 8. — Jiraiya, Jun 01 '15 at 17:12
The problem is that if you put in arbitrary content, almost any byte in the range `0x80-0xff` will form an invalid character and produce the replacement character, even under Java 7. The difference lies only in the few cases where the bytes happened to form a surrogate character (by pure accident), but the solution in the linked question only works for valid sequences (valid regarding modified UTF-8) as it flags errors via Exception rather than producing replacement characters. Oh, and `\0` is handled differently. — Holger, Jun 01 '15 at 17:35
I was wondering if it is okay to manually replace invalid bytes other than the ones falling between U+D800 to U+DFFF inclusive with the replacement string and then use your method to read them as UTF using an implementation of data input. Your thoughts on that? — Jiraiya, Jun 01 '15 at 18:14
That would require knowledge about the (modified) UTF-8 format as well as how *both*, the Java 7 `CharsetDecoder` and the `DataInputStream` handle errors. I’m not even sure whether it’s possible to get `DataInputStream` into producing the same behavior. If you are that deep into the matter, your are more than halfway to implementing your own decoder. — Holger, Jun 01 '15 at 18:34
If the goal is solely the preserve the binary data in a `string`, you can use CP437. All byte values and sequences are valid in CP437. If you want the exact string the Java 7 erroneously produced then you'll have invent and implement such a CharsetDecoder. — Tom Blodget, Jun 01 '15 at 22:48
http://stackoverflow.com/questions/25404373/java-8-utf-8-encoding-issue-java-bug — user1050755, Sep 11 '15 at 08:07

Jiraiya · Accepted Answer · 2015-06-03T06:59:32.983

8

From the pointers provided by @Holger, It was clear that we had to write a custom CharsetDecoder.

I copied over OpenJDK's version of sun.nio.cs.UTF_8 class, renamed it to CustomUTF_8 and used it to construct a string like so

String output = new String(bytes, new CustomUTF_8());

I plan to run extensive tests cross verifying the outputs generated on Java 7 and Java 8. This is an interim solution while I am trying to fix the actual problem of passing output from hmac directly to String without Base64 encoding it first to.

 String output = new String(Base64.Encoder.encode(bytes), Charset.forname("UTF-8"));

edited Jun 03 '15 at 06:59

answered Jun 02 '15 at 10:39

Jiraiya

336
2
8

If testing goes well, it would be a good idea to release it as a library and put into Maven Central, so other people with the same problem can use it. – Tagir Valeev Jun 03 '15 at 08:18
@TagirValeev I think its a bad Idea to make it easy for people to do this. The CharsetDecoder class translates a sequence of bytes in a specific charset to a sequence of sixteen-bit Unicode characters. Patch work at this level is dangerous as I cannot be very sure that a sequence of sixteen-bit Unicode characters is interpreted in across two versions of the JVM. – Jiraiya Jun 04 '15 at 14:51

Java 8 change in UTF-8 decoding

1 Answers1

Linked