Yes, UTF-16 was invented when Unicode expanded pass the 65536 code point limit of Unicode 1.0 to the 1114112 code point limit it has today.
This allows it to support the entire Universal Character Set while maintaining compatibility with UCS-2; the obsolete encoding of all Unicode characters as two-byte units that is obsolete precisely because it cannot encode all Unicode characters in Unicode 2.0 or later.
Does the first char offer some method to determine that a second char is used, or that the 2 belong together?
Yes, in UTF-16, a two-byte unit is either:
- A high surrogate, which must always be followed by a low surrogate. Between
0xD800
and 0xDBFF
inclusive, isHighSurrogate
will return true
.
- A low surrogate that must always follow a high surrogate. Between
0xDC00
and 0xDFFF
inclusive, isLowSurrogate
will return true
.
- A non-surrogate.
Non-surrogate map directly with a BMP character of the same code point.
Surrogates combine to represent astral plane characters:
- Subtract 0x010000 from the code point.
- Add the top 10 bits to 0xD800 to get the high surrogate.
- Add the lower 10 bits to 0xDC00 to get the low surrogate.
In Java you can do this by first checking isBmpCodePoint
on an int
with the codepoint. If that is true then you can just cast it to char
to get the single UTF-16 unit that encodes it. Otherwise you can call highSurrogate
to get the first char
and lowSurrogate
to get the second.
As well as isBmpCodePoint
you could use charCount
which returns 1
for BMP characters and 2
if you need surrogates. This is useful if you are going to create an array of either 1
or 2
characters to hold the value.
Since the surrogate code points are never assigned characters, this means the encoding is unambiguous for the entire Universal Character Set.
It's also self correcting, a mistake in the stream can be isolated rather than leading to all further characters being misread. E.g. If we find an isolated low surrogate we know that bit is wrong but can still read the rest of the stream.
Some full examples, but I'm not too hot in Java (Unicode on the other hand, I know well, and that's the knowledge I used to answer this), so if someone spots a n00b Java error but thinks I got the Unicode-knowledge part correct please just go ahead and edit this post accordingly:
""
is a string with a single Unicode character, U+10300
which is a letter from the Old Italic Alphabet. For the most part, these "Astral Planes" characters as they're semi-jokingly called are relatively obscure as the Unicode Consortium try to be as useful as they can without going outside the easier-to-use BMP (Basic Multilingual Plane; U+0000
to U+FFFF
, though sometimes listed as "U+0000
to U+FFFD
as U+FFFE
and U+FFFF
are both non-characters and shouldn't be used in most cases).
(If you're experimenting with this, then those that use
directly will depend on how well your text editor copes with it).
If you examine "".length
you'll get 2
because length
gives you the number of UTF-16 encoding units, not the number of characters.
new StringBuilder().appendCodePoint(0x10300).toString() == ""
should return true
.
Character.charCount(0x10300)
will return 2
as we need two UTF-16 char
to encode it. Character.isBmpCodePoint(0x10300)
will return false
.
Character.codePointAt("", 0)
will return 66304
which is 0x10300
, because when it sees a high surrogate it includes reading the following low surrogate in the calculation.
Character.highSurrogate(0x10300) == 0xD800 && Character.lowSurrogate(0x10300) == 0xDF00
is true, as those are the high and low surrogates the character should be split into to encode in UTF-16.
Likewise "".charAt(0) == 0xD800 && "".charAt(1) == 0xDF00
because charAt
deals with UTF-16 units, not Unicode characters.
By the same token "" == "\uD800\uDF00"
which uses escapes for the two surrogates.