Why does a byte in Java I/O can represent a character?
And I see the characters are only ASCII. Then it's not dynamic, right?
Is there any explanation for this?
What is the difference between byte streams and character streams?
Why does a byte in Java I/O can represent a character?
And I see the characters are only ASCII. Then it's not dynamic, right?
Is there any explanation for this?
What is the difference between byte streams and character streams?
Bytes are not characters. Alone, they can't even represent characters.
Computingwise, a "character" is a pairing of a numeric code (or sequence of codes) with an encoding or character set that defines how the codes map to real-world characters (or to whitespace, or to control codes).
Only once paired with an encoding can bytes represent characters. With some encodings (like ASCII or ISO-8859-1), one byte can represent one character...and many encodings are even ASCII-compatible (meaning that the character codes from 0 to 127 align with ASCII's definition for them)...but without the original mapping, you don't know what you have.
Without an encoding, bytes are just 8-bit integers.
You can interpret them any way you like by forcing an encoding onto them. That is exactly what you're doing when you convert a byte
to char
, say new String(myBytes)
, etc, or even edit a file containing the bytes in a text editor. (In that case, it's the editor applying the encoding.) In doing so, you might even get something that makes sense. But without knowing the original encoding, you can't know for sure what those bytes were intended to represent.
It might not even be text.
For example, consider the byte sequence 0x48 0x65 0x6c 0x6c 0x6f 0x2e
. It can be interpreted as:
Hello.
in ASCII and compatible 8-bit encodings;dinner
in some 8-bit encoding i made up just to prove this point;䡥汬漮
in big-endian UTF-16*;load r101, [0x6c6c6f2e]
in some unknown processor's assembly language;or any of a million other things. Those six bytes alone can't tell you which interpretation is correct.
With text, at least, that's what encodings are for.
But if you want the interpretation to be right, you need to use the same encoding to decode those bytes as was used to generate them. That's why it's so important to know how your text was encoded.
The difference between a byte stream and a character stream is that the character stream attempts to work with characters rather than bytes. (It actually works with UTF-16 code units. But since we know the encoding, that's good enough for most purposes.) If it's wrapped around a byte stream, the character stream uses an encoding to convert the bytes read from the underlying byte stream to char
s (or char
s written to the stream to bytes).
* Note: I don't know whether "䡥汬漮" is profanity or even makes any sense...but neither does a computer unless you program it to read Chinese.
Bytes can represent some chars for the same reason an int can represent a long.
Char is 16-bit. Byte is 8 bit. Furthermore, char is unsigned, byte is signed.
Try doing this:
char c = 'a';
System.out.println(c);
byte b = (byte)c;
c = (char)b;
System.out.println(c);
This will output:
a
a
Now try replacing 'a' with nDash (unicode 2013). Like this:
char c = '–';
System.out.println(c);
byte b = (byte)c;
c = (char)b;
System.out.println(c);
This will output:
–
In C and C++, a char
holds a single byte, and the types char
is used to mean an 8-bit integer, as well as a single character of text. Java is not like that.
In Java, a char
and a byte
are different data types. A char
holds a single Unicode character which is (generally) larger than a byte. A byte
holds an 8-bit integer. When you convert a char
(or char[]
or a String
) to a byte array (type byte[]
), the string is encoded according to some character encoding (usually UTF-8), and the result is how that particular string would be stored in memory (or on disk) if it was written according to that character encoding.
Java IO supports reading byte arrays (byte[]
) directly to or from disk because this is how one generally works with binary files (i.e. non-text files, where linebreaks shouldn't be converted, and strings shouldn't be re-encoded). The bytes in that file may correspond to characters in an 8-bit encoding (like ASCII or ISO8859-*), but if you're going to use them that way, you should do an explicit conversion to a char[]
or a String
).
The reason it is a byte is due to historical American computing. Memory, speed, storage all were extremely expensive (and big) back when basic computing concepts were invented. Designs were very simplified and so focused on the North American English speaking world (and to some extent, still are).
Multiple bytes, like int, were only added after the foreign (to the USA) market opened up and computers had more RAM and storage space. The world uses complex writing systems, such as Chinese, that requires more than one byte per character. You are likely from a part of the world that requires multi-byte characters. When I was learning programming in North America, ASCII char bytes were all I even needed to consider. The Java designers were mostly from North America too.
As an example, the Chinese logographical writing alphabet is huge by my North American abcdefghijklmnopqrstuvwxyz
standards.