Why does Java char use UTF-16?

Question

I have been reading about how Unicode code points have evolved over time, including this article by Joel Spolsky, which says:

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

But despite all this reading, I couldn't find the real reason that Java uses UTF-16 for a char.

Isn't UTF-8 far more efficient than UTF-16? For example, if I had a string which contains 1024 letters of ASCII scoped characters, UTF-16 will take 1024 * 2 bytes (2KB) of memory.

But if Java used UTF-8, it would be just 1KB of data. Even if the string has a few character which needs to 2 bytes, it will still only take about a kilobyte. For example, suppose in addition to the 1024 characters, there were 10 characters of "字" (code-point U+5b57, UTF-8 encoding e5 ad 97). In UTF-8, this will still take only (1024 * 1 byte) + (10 * 3 bytes) = 1KB + 30 bytes.

So this doesn't answer my question. 1KB + 30 bytes for UTF-8 is clearly less memory than 2KB for UTF-16.

Of course it makes sense that Java doesn't use ASCII for a char, but why does it not use UTF-8, which has a clean mechanism for handling arbitrary multi-byte characters when they come up? UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.

Is there some good reason for UTF-16 that I'm missing?

Suppose you want to access the 576th char of the string, and it's represented as an UTF8 encoded byte array. What is the cost of the operation? — JB Nizet, Mar 26 '16 at 14:32
hmm, sure I missed the point of cursoring. It have to compute all of the X byte to decide which charachter is it. Then they decided to sacrifice memory against the cpu. — FZE, Mar 26 '16 at 14:56
Strings are immutable - it is possible (and it still would be possible to retrofit this without breaking existing *Java* code [it would probably break JNI]) to store strings with only codes 0-255 in an 8-bit encoding, and strings with other codes in 16-bit like it is now. But it seems that the need for this isn't very high (at least I haven't seen a big demand for this). — Erwin Bolwidt, Mar 26 '16 at 15:05
@ErwinBolwidt it's actually [scheduled for Java 9](http://openjdk.java.net/jeps/254) — Clashsoft, Mar 26 '16 at 17:20
Good question. Also, good answers [here](http://stackoverflow.com/questions/3240498/why-does-the-java-ecosystem-use-different-character-encodings-throughout-their-s) and [here](http://programmers.stackexchange.com/questions/174947/why-does-java-use-utf-16-for-internal-string-representation). — aioobe, May 19 '16 at 05:57
It's safER to just go to the (576*2)th byte in a UTF16 string to find the 576th character. But UTF16 still allows for 32 bit characters (two 16-bit code points). afaik, java (and c# as well for that matter) just ignore this when accessing the Nth character in a string, meaning you could either end up at a different character than you expected to, or end up with half a character. — Cedric Mamo, May 19 '16 at 20:33
In UTF-8, the character 字 needs not just 2, but the 3 bytes `e5 ad 97`. — Evgeniy Berezovsky, Aug 23 '16 at 03:28
@JBNizet your rhetorical question is misleading: UTF8 and UTF16 have the same performance in that case. Unless the JVM keeps track of whether the string has only BMP code points and optimizes for that case. — Cody Piersall, Jul 17 '17 at 18:03
The .Net framework does this too... no clue why. Weird early design decisions I guess. — Nyerguds, Oct 12 '18 at 08:39

score 36 · Accepted Answer · edited Feb 21 '23 at 12:58

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.

score -1 · Answer 2 · edited Feb 21 '23 at 12:55

-1

Historically, one reason was the performance characteristics of random access or iterating over the characters of a String:

UTF-8 encoding uses a variable number (1-4) bytes to encode a Unicode character. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.

Even today, Python uses a fixed-width format for Strings internally, storing either 1, 2, or 4 bytes per character depending on the maximum size of a character in that string.

Of course, this is no longer a pure benefit in Java, since, as nj_ explains, Java no longer uses a fixed-with format. But at the time the language was developed, Unicode was a fixed-width format (now called UCS-2), and this would have been an advantage.

edited Feb 21 '23 at 12:55

Josiah Yoder

3,321
4
40
58

answered Mar 26 '16 at 14:33

wero

32,544
3
59
84

8

This was true for UCS-2, but UCS-2 ceased to be when Unicode expanded beyond the BMP (i.e. beyond the first 65536 characters); nowadays there's only UTF-16 and it is a variable length encoding exactly as UTF-8. You can put the head under the sand and think you are iterating over Unicode code points only until you find the first surrogate pair. See @nj_'s answer for the details. – Matteo Italia May 19 '16 at 20:26
@MatteoItalia The question asks why Java does not use e.g. UTF-8 to store Strings in order to save memory compared to the current implementation. My answer gave a particular reason - namely performance of accessing characters by index - why UTF-8 night not be a good idea. – wero May 19 '16 at 21:03
6

the point is that UTF-16 is *too* a variable length encoding. – Matteo Italia May 19 '16 at 21:53
@MatteoItalia so you want Oracle to remove `String.charAt` because it allows people to put their head under the sand? – wero May 19 '16 at 22:00
21

No, I want to point out that it's false that UTF-16 has any advantage over UTF-8 related to seeking at a given code point, because *UTF-16, exactly as UTF-8, is a variable length encoding*, which takes either 1 or 2 code units to encode a single code point. **If you want O(1) seek to a given code point you want UTF-32, not UTF-16**. For this reason, your answer is plain wrong - or actually, outdated of 21 years (IIRC it was in 1995 that Unicode was expanded beyond the BMP, killing the fixed-lenght UCS-2 encoding, which became the UTF-16 variable length encoding). – Matteo Italia May 19 '16 at 23:04

Why does Java char use UTF-16?

2 Answers2

Linked