Java was designed and first implemented back in the days when Unicode was specified to be a set of 16 bit code-points. That is why char
is a 16 bit type, and why String
is modeled as a sequence of char
.
Now, if the Java designers had been able to foresee that Unicode would add extra "code planes", they might1 have opted for a 32 bit char
type.
Java 1.0 came out in January 1996. Unicode 2.0 (which introduced the higher code planes and the surrogate mechanism) was released in July 1996.
Internally, I believe that some versions of Java have used UTF-8 as the representation for strings, at least at some level. However, it is still necessary to map this to the methods specified in the String
API because that is what Java applications require. Doing that if the primary internal representation is UTF-8 rather than UTF-16 is going to be inefficient.
And before you suggest that they should "just change the String APIs" ... consider how many trillions of lines of Java code already exist that depend on the current String APIs.
For what it is worth, most if not all programming languages that support Unicode do it via a 16 bit char
or wchar
type.
1 - ... and possibly not, bearing in mind that memory was a lot more expensive in those days, and programmers worried much more about such things in those days.