29

I have been reading about how Unicode code points have evolved over time, including this article by Joel Spolsky, which says:

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

But despite all this reading, I couldn't find the real reason that Java uses UTF-16 for a char.

Isn't UTF-8 far more efficient than UTF-16? For example, if I had a string which contains 1024 letters of ASCII scoped characters, UTF-16 will take 1024 * 2 bytes (2KB) of memory.

But if Java used UTF-8, it would be just 1KB of data. Even if the string has a few character which needs to 2 bytes, it will still only take about a kilobyte. For example, suppose in addition to the 1024 characters, there were 10 characters of "字" (code-point U+5b57, UTF-8 encoding e5 ad 97). In UTF-8, this will still take only (1024 * 1 byte) + (10 * 3 bytes) = 1KB + 30 bytes.

So this doesn't answer my question. 1KB + 30 bytes for UTF-8 is clearly less memory than 2KB for UTF-16.

Of course it makes sense that Java doesn't use ASCII for a char, but why does it not use UTF-8, which has a clean mechanism for handling arbitrary multi-byte characters when they come up? UTF-16 looks like a waste of memory in any string which has lots of non-multibyte chars.

Is there some good reason for UTF-16 that I'm missing?

Josiah Yoder
  • 3,321
  • 4
  • 40
  • 58
FZE
  • 1,587
  • 12
  • 35
  • 3
    Suppose you want to access the 576th char of the string, and it's represented as an UTF8 encoded byte array. What is the cost of the operation? – JB Nizet Mar 26 '16 at 14:32
  • hmm, sure I missed the point of cursoring. It have to compute all of the X byte to decide which charachter is it. Then they decided to sacrifice memory against the cpu. – FZE Mar 26 '16 at 14:56
  • 1
    Strings are immutable - it is possible (and it still would be possible to retrofit this without breaking existing *Java* code [it would probably break JNI]) to store strings with only codes 0-255 in an 8-bit encoding, and strings with other codes in 16-bit like it is now. But it seems that the need for this isn't very high (at least I haven't seen a big demand for this). – Erwin Bolwidt Mar 26 '16 at 15:05
  • 7
    @ErwinBolwidt it's actually [scheduled for Java 9](http://openjdk.java.net/jeps/254) – Clashsoft Mar 26 '16 at 17:20
  • Good question. Also, good answers [here](http://stackoverflow.com/questions/3240498/why-does-the-java-ecosystem-use-different-character-encodings-throughout-their-s) and [here](http://programmers.stackexchange.com/questions/174947/why-does-java-use-utf-16-for-internal-string-representation). – aioobe May 19 '16 at 05:57
  • 3
    It's safER to just go to the (576*2)th byte in a UTF16 string to find the 576th character. But UTF16 still allows for 32 bit characters (two 16-bit code points). afaik, java (and c# as well for that matter) just ignore this when accessing the Nth character in a string, meaning you could either end up at a different character than you expected to, or end up with half a character. – Cedric Mamo May 19 '16 at 20:33
  • In UTF-8, the character 字 needs not just 2, but the 3 bytes `e5 ad 97`. – Evgeniy Berezovsky Aug 23 '16 at 03:28
  • 1
    @JBNizet your rhetorical question is misleading: UTF8 and UTF16 have the same performance in that case. Unless the JVM keeps track of whether the string has only BMP code points and optimizes for that case. – Cody Piersall Jul 17 '17 at 18:03
  • The .Net framework does this too... no clue why. Weird early design decisions I guess. – Nyerguds Oct 12 '18 at 08:39

2 Answers2

36

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical:

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character.

This, and the birth of UTF-16, is further explained by the Unicode FAQ page:

Originally, Unicode was designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16.

As @wero has already mentioned, random access cannot be done efficiently with UTF-8. So all things weighed up, UCS-2 was seemingly the best choice at the time, particularly as no supplementary characters had been allocated by that stage. This then left UTF-16 as the easiest natural progression beyond that.

Josiah Yoder
  • 3,321
  • 4
  • 40
  • 58
nj_
  • 2,219
  • 1
  • 10
  • 12
-1

Historically, one reason was the performance characteristics of random access or iterating over the characters of a String:

UTF-8 encoding uses a variable number (1-4) bytes to encode a Unicode character. Therefore accessing a character by index: String.charAt(i) would be way more complicated to implement and slower than the array access used by java.lang.String.

Even today, Python uses a fixed-width format for Strings internally, storing either 1, 2, or 4 bytes per character depending on the maximum size of a character in that string.

Of course, this is no longer a pure benefit in Java, since, as nj_ explains, Java no longer uses a fixed-with format. But at the time the language was developed, Unicode was a fixed-width format (now called UCS-2), and this would have been an advantage.

Josiah Yoder
  • 3,321
  • 4
  • 40
  • 58
wero
  • 32,544
  • 3
  • 59
  • 84
  • 8
    This was true for UCS-2, but UCS-2 ceased to be when Unicode expanded beyond the BMP (i.e. beyond the first 65536 characters); nowadays there's only UTF-16 and it is a variable length encoding exactly as UTF-8. You can put the head under the sand and think you are iterating over Unicode code points only until you find the first surrogate pair. See @nj_'s answer for the details. – Matteo Italia May 19 '16 at 20:26
  • @MatteoItalia The question asks why Java does not use e.g. UTF-8 to store Strings in order to save memory compared to the current implementation. My answer gave a particular reason - namely performance of accessing characters by index - why UTF-8 night not be a good idea. – wero May 19 '16 at 21:03
  • 6
    the point is that UTF-16 is *too* a variable length encoding. – Matteo Italia May 19 '16 at 21:53
  • @MatteoItalia so you want Oracle to remove `String.charAt` because it allows people to put their head under the sand? – wero May 19 '16 at 22:00
  • 21
    No, I want to point out that it's false that UTF-16 has any advantage over UTF-8 related to seeking at a given code point, because *UTF-16, exactly as UTF-8, is a variable length encoding*, which takes either 1 or 2 code units to encode a single code point. **If you want O(1) seek to a given code point you want UTF-32, not UTF-16**. For this reason, your answer is plain wrong - or actually, outdated of 21 years (IIRC it was in 1995 that Unicode was expanded beyond the BMP, killing the fixed-lenght UCS-2 encoding, which became the UTF-16 variable length encoding). – Matteo Italia May 19 '16 at 23:04