Strange UTF-8 processing changes when updating to Oracle Java 8

Question

Function being tested:

public static String removeNonprintableCharacters(String input) {
    StringBuilder newString = new StringBuilder(input.length());
    for (int offset = 0; offset < input.length();) {
        int codePoint = input.codePointAt(offset);
        offset += Character.charCount(codePoint);

        // Replace invisible control characters and unused code points
        switch (Character.getType(codePoint)) {
            case Character.CONTROL:     // \p{Cc}
            case Character.FORMAT:      // \p{Cf}
            case Character.PRIVATE_USE: // \p{Co}
            case Character.SURROGATE:   // \p{Cs}
            case Character.UNASSIGNED:  // \p{Cn}
                newString.append("\ufffd");
                break;
            default:
                newString.append(Character.toChars(codePoint));
                break;
        }
    }
    return newString.toString();
}

Test method:

@Test
public void testRemoveNonprintableCharacters() throws UnsupportedEncodingException {
    assertEquals("\ufffd", r(new byte[]{0}));
    // jdk7:
    //assertEquals("\ufffd", r(new byte[]{-7, 'a'}));
    // jdk8: (???)
    assertEquals("\ufffda", r(new byte[]{-7, 'a'}));
}

private String r(byte[] bytes) throws UnsupportedEncodingException {
    return Unicode.removeNonprintableCharacters(new String(bytes, "UTF-8"));
}

As you can see in the test method, the returned result is different after upgrading the JVM to Java 8... why?

Does your source code actually contain `�` or is that an artifact of copy/paste? In order to take that part out of the equation, I suggest you use `\u` escape sequences in both `removeNonprintableCharacters` and your test. — Jon Skeet, Sep 10 '15 at 21:43
that doesnt really matter as the source code didnt change at all during the upgrade and source code encoding is fixed in pom.xml...... — user1050755, Sep 10 '15 at 21:45
It matters if the way the compiler "understood" your code changed, doesn't it? (I assume this is having recompiled the code.) Given that it's always good to narrow the problem down to the smallest number of possible causes, I would definitely do this... — Jon Skeet, Sep 10 '15 at 21:46
okay, changed it. still the same problem. jdk8 - pass, jdk7 - fail. — user1050755, Sep 10 '15 at 21:56
Right. I can't look now, but will do so when I get a chance. — Jon Skeet, Sep 10 '15 at 21:56
Maybe I'm misunderstanding. Using ideone with Java 7, your [code](http://ideone.com/kHMrCp) succeeds with the jdk8 version. — Sotirios Delimanolis, Sep 10 '15 at 22:28
Okay, having reproduced it, most of the code in your question isn't relevant - it's *only* the `new String(bytes, "UTF-8")` part that changes between versions. Basically the handling of invalid byte sequences appears to have changed. Now as your production code only accepts a `String`, that doesn't affect it - so do you have code elsewhere which is converting binary data into text data, and you want that to handle invalid binary data (in terms of it not being valid UTF-8) in a particular way? — Jon Skeet, Sep 11 '15 at 06:31
It's a bug fix: http://stackoverflow.com/questions/25404373/java-8-utf-8-encoding-issue-java-bug — user1050755, Sep 11 '15 at 08:08
By the way, you don’t need complicated constructs like `newString.append(Character.toChars(codePoint));`. Just use [`newString.appendCodePoint(codePoint);`](http://docs.oracle.com/javase/8/docs/api/java/lang/StringBuilder.html#appendCodePoint-int-). It’s shorter and saves a method invocation and array creation… — Holger, Sep 11 '15 at 10:43

Strange UTF-8 processing changes when updating to Oracle Java 8

0 Answers0