2

Why does a byte in Java I/O can represent a character?

And I see the characters are only ASCII. Then it's not dynamic, right?

Is there any explanation for this?

What is the difference between byte streams and character streams?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Keenan Gebze
  • 1,366
  • 1
  • 20
  • 26
  • 8
    You may find this useful: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html) – Adam Paynter Sep 17 '10 at 16:27

4 Answers4

33

Bytes are not characters. Alone, they can't even represent characters.

Computingwise, a "character" is a pairing of a numeric code (or sequence of codes) with an encoding or character set that defines how the codes map to real-world characters (or to whitespace, or to control codes).

Only once paired with an encoding can bytes represent characters. With some encodings (like ASCII or ISO-8859-1), one byte can represent one character...and many encodings are even ASCII-compatible (meaning that the character codes from 0 to 127 align with ASCII's definition for them)...but without the original mapping, you don't know what you have.

Without an encoding, bytes are just 8-bit integers.

You can interpret them any way you like by forcing an encoding onto them. That is exactly what you're doing when you convert a byte to char, say new String(myBytes), etc, or even edit a file containing the bytes in a text editor. (In that case, it's the editor applying the encoding.) In doing so, you might even get something that makes sense. But without knowing the original encoding, you can't know for sure what those bytes were intended to represent.

It might not even be text.

For example, consider the byte sequence 0x48 0x65 0x6c 0x6c 0x6f 0x2e. It can be interpreted as:

  • Hello. in ASCII and compatible 8-bit encodings;
  • dinner in some 8-bit encoding i made up just to prove this point;
  • 䡥汬漮 in big-endian UTF-16*;
  • a steel-blue pixel followed by a greyish-yellowish one, in RGB;
  • load r101, [0x6c6c6f2e] in some unknown processor's assembly language;

or any of a million other things. Those six bytes alone can't tell you which interpretation is correct.

With text, at least, that's what encodings are for.

But if you want the interpretation to be right, you need to use the same encoding to decode those bytes as was used to generate them. That's why it's so important to know how your text was encoded.


The difference between a byte stream and a character stream is that the character stream attempts to work with characters rather than bytes. (It actually works with UTF-16 code units. But since we know the encoding, that's good enough for most purposes.) If it's wrapped around a byte stream, the character stream uses an encoding to convert the bytes read from the underlying byte stream to chars (or chars written to the stream to bytes).

* Note: I don't know whether "䡥汬漮" is profanity or even makes any sense...but neither does a computer unless you program it to read Chinese.

cHao
  • 84,970
  • 20
  • 145
  • 172
  • Why it has to be byte? not an int? – Keenan Gebze Sep 18 '10 at 05:57
  • 1
    Because underneath it all, almost all streams read and write data in byte-sized chunks. InputStreamReaders and OutputStreamWriters and DataOutputStreams and DataInputStreams and such as that can provide the illusion of writing data in bigger chunks than that, but even they write bytes behind the scenes in most cases (unless you have something like a StringWriter, that writes to a StringBuilder or something rather than an underlying stream). Those other streams just provide a way to convert the bytes into more usable data, like chars or ints. – cHao Sep 18 '10 at 07:30
  • ooh so the character streams only 'chungking' the byte streams. And is the readInt, readLong, read... function is also 'chungking'? – Keenan Gebze Sep 18 '10 at 13:16
  • 1
    Pretty much. They read a certain number of bytes and build them into an int, long, etc, much like a char stream would do for a char. The difference is that Java specifies how bytes are turned into ints and such, because there's only one or two ways anyway and Java chose one ("big-endian", for future reference) rather than having IntEncodings or something and complicating the process. They couldn't do that with chars, cause there are dozens of different (and common!) ways to translate bytes to chars, depending on the language and a few other things. So you have to specify an encoding there. – cHao Sep 18 '10 at 17:40
  • byte types are not characters, but char types neither! They're all numbers (bits)! :D The char type can hold unsigned 16bit and, thus, can represent all unicode characters. This is all about abstraction. – L. Holanda Nov 15 '12 at 19:50
  • @Leo: If we want to get technical about it, a `char` can't represent all Unicode chars either. :) What we typically think of as "Unicode" actually isn't -- it's a variant of UTF-16, and would require two `char`s to represent any character outside the BMP (which includes a whole bunch of Asian chars, among other stuff). – cHao Nov 15 '12 at 21:36
  • What i'm talking about is that conceptually, you can consider bytes and characters as being in two wholly different worlds. If you're messing with bytes, you're reading/writing binary, and if you're messing with chars, it's text. Period. If you want text on a byte stream, what you really want is an InputStreamReader with the proper encoding. – cHao Nov 15 '12 at 21:51
  • @cHao: I wouldn't think char and byte as two different worlds. It's all about numbers and abstraction. They differ the number of bits and signed/unsigned bit. But you're right: chars cannot represent all available Unicode, only Basic Multilingual Plane 0000-FFFF. As a matter of best practice, you definitely right: byte=binary, char=text – L. Holanda Dec 05 '12 at 00:11
9

Bytes can represent some chars for the same reason an int can represent a long.

Char is 16-bit. Byte is 8 bit. Furthermore, char is unsigned, byte is signed.

Try doing this:

char c = 'a';
System.out.println(c);
byte b = (byte)c;
c = (char)b;
System.out.println(c);

This will output:

a
a

Now try replacing 'a' with nDash (unicode 2013). Like this:

char c = '–';
System.out.println(c);
byte b = (byte)c;
c = (char)b;
System.out.println(c);

This will output:

L. Holanda
  • 4,432
  • 1
  • 36
  • 44
3

In C and C++, a char holds a single byte, and the types charis used to mean an 8-bit integer, as well as a single character of text. Java is not like that.

In Java, a char and a byte are different data types. A char holds a single Unicode character which is (generally) larger than a byte. A byte holds an 8-bit integer. When you convert a char (or char[] or a String) to a byte array (type byte[]), the string is encoded according to some character encoding (usually UTF-8), and the result is how that particular string would be stored in memory (or on disk) if it was written according to that character encoding.

Java IO supports reading byte arrays (byte[]) directly to or from disk because this is how one generally works with binary files (i.e. non-text files, where linebreaks shouldn't be converted, and strings shouldn't be re-encoded). The bytes in that file may correspond to characters in an 8-bit encoding (like ASCII or ISO8859-*), but if you're going to use them that way, you should do an explicit conversion to a char[] or a String).

Ken Bloom
  • 57,498
  • 14
  • 111
  • 168
0

The reason it is a byte is due to historical American computing. Memory, speed, storage all were extremely expensive (and big) back when basic computing concepts were invented. Designs were very simplified and so focused on the North American English speaking world (and to some extent, still are).

Multiple bytes, like int, were only added after the foreign (to the USA) market opened up and computers had more RAM and storage space. The world uses complex writing systems, such as Chinese, that requires more than one byte per character. You are likely from a part of the world that requires multi-byte characters. When I was learning programming in North America, ASCII char bytes were all I even needed to consider. The Java designers were mostly from North America too.

As an example, the Chinese logographical writing alphabet is huge by my North American abcdefghijklmnopqrstuvwxyz standards.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
unixman83
  • 9,421
  • 10
  • 68
  • 102