Java stream misconceptions... some clarification?

Question

I understand that byte streams deal with bytes and character streams deal with characters... if I use a byte stream to read in characters, could this limit me to the sorts of characters I might read? For instance, bytes are read in as 8 bit bytes, characters are read in as 16 bit characters... does this mean that more characters can be represented using character streams rather than byte streams?

The last thing im confused about is how a byte stream writes out to a file for reading. If I was recieving bytes from a network socket, I would wrap them in a InputStreamReader for writing, this way I would get the character transformation logic the character stream provides. If I read from a file using a FileInputStream and write out using a FileOutputStream, why is this file readable when I open it with a text editor? How is the FileOutputStream treating the bytes?

"byte streams deal with bytes and character streams deal with characters". The offical terms here are InputStream/OutputStream for byte data, and Reader/Writer for characters. — Thilo, Aug 11 '11 at 12:12

score 3 · Accepted Answer · answered Aug 11 '11 at 12:14

The key concept here is character encoding: each human readable character is somehow encoded into one or more bytes. There are plenty of character encodings. The most popular ones are:

ASCII (7 bit, remaining bit is unused) that treats one character as one byte
UTF-8: most common characters are represented as a single byte, less common as 2 or even more

These encodings are readable even when you open a file in hex editor. However there many character encodings that do not have this feature, namely UTF-16 and UTF-32.

Now back to your question: InputStream only gives you a stream of bytes. If your bytes represent characters encoded with ASCII or UTF-8, most of the time you are fine. But if these bytes represent something more sophisticated like UTF-16, you absolutely need a Reader. Of course the reader has to know which character encoding does the underlying InputStream provide. This is often a problem done by the beginners - Reader not initialized with character encoding explicitly will often fall back to system default.

Other way (with writers) is similar. If you simply cast your chars to bytes, most of the time you will be fine. But if your characters contain less popular national letters, your output will be malformed/truncated. So you create a Writer that converts each given charater to a series of one or more bytes. Once again you are obligated to provide the character encoding.

Important rules:

always use InputStream when dealing with binary data (multimedia, ZIP and PDF files, etc.)
always use Reader when reading text (txt, HTML, XML...)
always know and specify character encoding when reading character from byte stream, always consciously choose character encoding you use to write the data.

great explanation; so, apart from being able to read in text that has been encoded using some irregular character set such as UTF-16, aren't character streams supposed to assist in promoting character set independence across platforms? How is this happening if I define what character encoding to use? — wulfgarpro, Aug 11 '11 at 12:46
First of all, UTF-16 is not *irregular*. It is used by Windows and Java internally :-). Readers/Writers are helping in abstracting the underlying character encoding. In Java you always deal with characters (`char`s) and you don't really care which character encoding is used inside Reader/Writer. For instance when a library provides you a `Reader` instance to read from, you don't care what encoding does this library use, you get encoding independent characters. — Tomasz Nurkiewicz, Aug 11 '11 at 12:54

score 2 · Answer 2 · answered Aug 11 '11 at 12:13

2

A char is a 16 bit string that represents a Unicode character.

A byte is an 8 bit string that represents a 2's complement number.

The important thing here is that they are both bit strings. Technically speaking, a char is simply 2 bytes. Nothing more, nothing less aside from some minor semantics with how Java treats the two. As far as the computer (or Input/OutputStreams) are concerned, the only difference is the number of bits they hold.

answered Aug 11 '11 at 12:13

tskuzzy

35,812
14
73
140

sure, but that doesn't really answer my question. Obviously an InputStream can only read characters represented using one byte. – wulfgarpro Aug 11 '11 at 12:31
Why's that? You can just make it read 2 bytes and convert that the a char. The reason why it reads a byte and not anything else is because it doesn't need to. A byte is the smallest unit that Java has direct support for and any other data type can be reconstructed from it's constituent bytes. – tskuzzy Aug 11 '11 at 12:34
OK cool; but, out of the box, a character stream will read in bytes, be in 1 or 2 to represent a certain encoded character? – wulfgarpro Aug 11 '11 at 12:42
It will read 2 bytes and then convert it into the corresponding Unicode character using the specified character set. – tskuzzy Aug 11 '11 at 12:44

score 1 · Answer 3 · edited May 23 '17 at 09:59

1

I think you need to grasp the relation between a byte and a character in order to get your clarification.

The accepted answer to this question is quite clear IMHO : Why does a byte in Java I/O can represent a character?

I'd also check out byte stream and character stream

And if you don't want Joel to catch you and make you peel onions for 6 months in a submarine, just read http://www.joelonsoftware.com/articles/Unicode.html

edited May 23 '17 at 09:59

Community

1
1

answered Aug 11 '11 at 12:09

Sébastien Nussbaumer

6,202
5
40
58

Reading Joel's explanation helped me a lot; everyone should read it. – wulfgarpro Aug 15 '11 at 12:06

score 0 · Answer 4 · answered Aug 11 '11 at 12:16

0

All IO streams in java are just byte streams underneath. Byte to Character(and vice versa) conversions are done using encoding. But underneath it all, they are all bytes.

answered Aug 11 '11 at 12:16

Hyangelo

4,784
4
26
33

score 0 · Answer 5 · answered Aug 11 '11 at 12:25

To answer your questions:

I understand that byte streams deal with bytes and character streams deal with characters... if I use a byte stream to read in characters, could this limit me to the sorts of characters I might read?

Characters are not bytes. A character is store in one or more bytes according to the selected encoding scheme. The encoding scheme removes/extends the limit of sorts of characters you can read.

For instance, bytes are read in as 8 bit bytes, characters are read in as 16 bit characters... does this mean that more characters can be represented using character streams rather than byte streams?

In a way, yes.

The last thing im confused about is how a byte stream writes out to a file for reading. If I was recieving bytes from a network socket, I would wrap them in a InputStreamReader for writing, this way I would get the character transformation logic the character stream provides. If I read from a file using a FileInputStream and write out using a FileOutputStream, why is this file readable when I open it with a text editor? How is the FileOutputStream treating the bytes?

For bytes/data corresponding to characters, you should use OutputStreamWriter for writing to a file and make it readable with a text editor. You can specify encoding at creation and the stream will perform the encoding of you text data.

Java stream misconceptions... some clarification?

5 Answers5