2

I started with an InputStreamReader, but this buffered its input, reading more than was required from the input stream (as mentioned in its Java docs). Delving into the source code (java version "1.7.0_147-icedtea") I got to the sun.nio.cs.StreamDecoder class, which contained the comment:

// In order to handle surrogates properly we must never try to produce
// fewer than two characters at a time.  If we're only asked to return one
// character then the other is saved here to be returned later.

So I guess the question becomes "is this true, and if so why?" From my (very basic!) understanding of the 6 charsets required by the JLS, it is always possible to determine the exact number of bytes required to read a single character, so no read-ahead would be necessary.

Background is I had a binary file containing a bunch of data with different encodings (numbers, strings, single byte tokens etc.). The basic format was a repeating set of byte marker (indicating the type of data) followed by optional data if required for that type. The two types containing character data were null-terminated strings and strings with a preceding 2-byte length. So for null terminated strings I thought something like this would do the trick:

String readStringWithNull(InputStream in) throws IOException {
  StringWriter sw = new StringWriter();
  InputStreamReader isr = new InputStreamReader(in, "UTF-16LE");
  for (int i; (i = isr.read()) > 0; ) {
    sw.write(i);
  }
  return sw.toString();
}

But the InputStreamReader read ahead from the buffer, so subsequent read operations on the base InputStream missed data. For my particular case I knew that all characters would be UTF-16LE BMP (sort of UCS-2LE) so I just coded around that, but I'm still interested in the general case above.

Also, I've seen InputStreamReader buffering issue which is similar, but does not appear to answer this specific question.

Cheers,

Community
  • 1
  • 1
Barney
  • 2,786
  • 2
  • 32
  • 34
  • Thanks, but looks like `DataInputStream` returns a big-endian character, where as my specific data is little-endian. – Barney Apr 19 '12 at 01:23

1 Answers1

4

So I guess the question becomes "is this true, and if so why?"

Yes the comment is correct, though possibly a bit obscure in its phraseology.

A UTF-8 encoding of a single Unicode code-point consists of between 1 and 4 bytes; see the Wikipedia UTF-8 examples.. But in some cases, the Unicode code-point cannot be represented as one Java char. So the decoder potentially has to decode the multi-byte UTF-8 sequence as TWO Java char values ... and hold one of them back.

From my (very basic!) understanding of the 6 charsets required by the JLS, it is always possible to determine the exact number of bytes required to read a single character, so no read-ahead would be necessary.

It is a bit more complicated than this for variable-length encodings. The decoder reads ahead just enough bytes to form one Unicode code-point. This will be between 1 and 4 bytes for UTF-8, and by examining the bytes it knows when to stop. Then it decodes the bytes as 1 or 2 UTF-16 code-units (i.e. Java char values), delivers the first one, and saves the second one.

So you are potentially reading ahead in terms of bytes, but not in terms of code-points. And that is fine because the user's keyboard (for example) is generating code-points.


Also, it should be possible to create an unbuffered reader which performs exactly as the standard one, but only pulls a single code-point at a time from the underlying stream, and so could be used in my example above.

Yes it should be possible to do this. However such a reader would need to make up to 4 separate system calls in order to read a single code-point, and that is very inefficient.

In fact, wouldn't this appear to be a preferred implementation, as I can always buffer the stream myself if required.

No, it is not the preferred implementation. Yes, you could (in theory) buffer the stream yourself below the encoder. However most programs aren't written to build the stack like this:

Buffered Reader > InputStreamReader > BufferedInputStream > raw InputStream

instead they just do this:

Buffered Reader > InputStreamReader > raw InputStream

which would make your approach perform really slowly. (And you try explaining to the average Joe programmer why he should put an extra explicit buffering layer into the stack.)

The standard InputStreamReader from OpenJDK7 appears to immediately read and buffer up to 8k from the base stream.

If they didn't do something like this, performance would be terrible ... see above. Besides, this is documented behavior - the javadoc says:

"Each invocation of one of an InputStreamReader's read() methods may cause one or more bytes to be read from the underlying byte-input stream. To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation."

The bottom line is that your use-case (where you want absolutely no low-level read-ahead on a Reader stack.) is highly unusual, and not supported by the Java SE standard class library. If you really need this, feel free to implement your own version of InputStreamReader that doesn't read ahead. But it strikes me as a bit odd that you would really need this.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks for that - makes it much clearer... so it is always possible to determine the exact number of bytes required to read a single **code-point** from a stream, but this may need more than one Java character to represent it, so the ISR has to buffer that character? – Barney Apr 19 '12 at 01:11
  • Also, it should be possible to create an unbuffered reader which performs exactly as the standard one, but only pulls a single code-point at a time from the underlying stream, and so could be used in my example above. In fact, wouldn't this appear to be a preferred implementation, as I can always buffer the stream myself if required. The standard InputStreamReader from OpenJDK7 appears to immediately read and buffer up to 8k from the base stream. – Barney Apr 19 '12 at 01:17
  • Fair enough. Thanks again for the clearer explanation of how it all works. – Barney Apr 19 '12 at 02:35