1

Java uses UTF-16 for the internal text representation. But why? UTF-8 as it seems to me is more flexible.

From wiki:

UTF-8 requires either 8, 16, 24 or 32 bits (one to four octets) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character, and UTF-32 always requires 32 bits to encode a character.

Pavel_K
  • 10,748
  • 13
  • 73
  • 186
  • 1
    Answered at [Why does Java use UTF-16 for internal string representation?](http://programmers.stackexchange.com/questions/174947/why-does-java-use-utf-16-for-internal-string-representation). – Joe Oct 18 '15 at 06:21

1 Answers1

2

Java was designed and first implemented back in the days when Unicode was specified to be a set of 16 bit code-points. That is why char is a 16 bit type, and why String is modeled as a sequence of char.

Now, if the Java designers had been able to foresee that Unicode would add extra "code planes", they might1 have opted for a 32 bit char type.

Java 1.0 came out in January 1996. Unicode 2.0 (which introduced the higher code planes and the surrogate mechanism) was released in July 1996.


Internally, I believe that some versions of Java have used UTF-8 as the representation for strings, at least at some level. However, it is still necessary to map this to the methods specified in the String API because that is what Java applications require. Doing that if the primary internal representation is UTF-8 rather than UTF-16 is going to be inefficient.

And before you suggest that they should "just change the String APIs" ... consider how many trillions of lines of Java code already exist that depend on the current String APIs.


For what it is worth, most if not all programming languages that support Unicode do it via a 16 bit char or wchar type.


1 - ... and possibly not, bearing in mind that memory was a lot more expensive in those days, and programmers worried much more about such things in those days.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216