11

It seems that Files.newBufferedReader() is more strict about UTF-8 than the naive alternative.

If I create a file with a single byte 128---so, not a valid UTF-8 character---it will happily be read if I construct an BufferedReader on an InputStreamReader on the result of Files.newInputStream(), but with Files.newBufferedReader() an exception is thrown.

This code

try (
    InputStream in = Files.newInputStream(path);
    Reader isReader = new InputStreamReader(in, "UTF-8");
    Reader reader = new BufferedReader(isReader);
) {
    System.out.println((char) reader.read());
}

try (
    Reader reader = Files.newBufferedReader(path);
) {
    System.out.println((char) reader.read());
}

has this result:

�
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
    at java.io.InputStreamReader.read(InputStreamReader.java:184)
    at java.io.BufferedReader.fill(BufferedReader.java:161)
    at java.io.BufferedReader.read(BufferedReader.java:182)
    at TestUtf8.main(TestUtf8.java:28)

Is this documented? And is it possible to get the lenient behavior with Files.newBufferedReader()?

Robert Tupelo-Schneck
  • 10,047
  • 4
  • 47
  • 58
  • 1
    Wild stab in the dark, but have you tried specifying charset in the newBufferedReader call? – JustinKSU Jan 19 '16 at 20:29
  • 2
    @JustinKSU He shouldn't have to. That method is [documented](http://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#newBufferedReader-java.nio.file.Path-) as using UTF-8. – VGR Jan 19 '16 at 20:41

2 Answers2

13

The difference is in how the CharsetDecoder used to decode the UTF-8 is constructed in the two cases.

For new InputStreamReader(in, "UTF-8") the decoder is constructed using:

Charset cs = Charset.forName("UTF-8");

CharsetDecoder decoder = cs.newDecoder()
          .onMalformedInput(CodingErrorAction.REPLACE)
          .onUnmappableCharacter(CodingErrorAction.REPLACE);

This is explicitly specifying that invalid sequences are just replaced with the standard replacement character.

Files.newBufferedReader(path) uses:

Charset cs = StandardCharsets.UTF_8;

CharsetDecoder decoder = cs.newDecoder();

In this case onMalformedInput and onUnmappableCharacter are not being called so you get the default action which is to throw the exception you are seeing.

There does not seem to be a way to change what Files.newBufferedReader does. I didn't see anything documenting this while looking through the code.

greg-449
  • 109,219
  • 232
  • 102
  • 145
7

From what I can tell, it is not documented anywhere, and it is not possible to get newBufferedReader to behave leniently.

It should be documented, though. In fact, the lack of documentation on it is a valid Java bug, in my opinion, even if the amended documentation ends up saying "invalid charset sequences result in undefined behavior."

Moreover, since there is no documentation on the subject, I don't think you can safely rely on the behavior you're observing. It's entirely possible that a future version of InputStreamReader will default to using an internal CharsetDecoder that is strict.

So, to guarantee lenient behavior, I would take your code a step farther:

try (
    InputStream in = Files.newInputStream(path);
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()
        .onMalformedInput(CodingErrorAction.REPLACE);
    Reader isReader = new InputStreamReader(in, decoder);
    Reader reader = new BufferedReader(isReader);
) {
    System.out.println((char) reader.read());
}
VGR
  • 40,506
  • 4
  • 48
  • 63