I have been reading about how Unicode code points have evolved over time, including this article by Joel Spolsky, which says:
Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.
But despite all this reading, I couldn't find the real reason that Java uses UTF-16 for a char
.
Isn't UTF-8 far more efficient than UTF-16? For example, if I had a string which contains 1024 letters of ASCII scoped characters, UTF-16 will take 1024 * 2 bytes (2KB) of memory.
But if Java used UTF-8, it would be just 1KB of data. Even if the string has a few character which needs to 2 bytes, it will still only take about a kilobyte. For example, suppose in addition to the 1024 characters, there were 10 characters of "字" (code-point U+5b57, UTF-8 encoding e5 ad 97). In UTF-8, this will still take only (1024 * 1 byte) + (10 * 3 bytes) = 1KB + 30 bytes.
So this doesn't answer my question. 1KB + 30 bytes for UTF-8 is clearly less memory than 2KB for UTF-16.
Of course it makes sense that Java doesn't use ASCII for a char, but why does it not use UTF-8, which has a clean mechanism for handling arbitrary multi-byte characters when they come up? UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.
Is there some good reason for UTF-16 that I'm missing?