Though String
contains Unicode as a char
array where char
is a two byte UTF-16BE encoding, there also is support for UCS4.
UCS4: UTF-32, "code points":
Unicode code points, UCS4, are represented in java as int
.
int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();
There are encodings, transformations, of code points to UTF-16 and UTF-8 which require longer sequences of respectively 2-byte or 1-byte values.
The encoding is chosen such that the 2/1-byte values will be different from any other value. That means that such a value will not erroneously match "/"
or any other string search. That is realized by high bits starting with 1...
and then bits of the code point in big-endian format (most significant first).
Rather than searching for UCS4 and UCS2 a search for UTF-16 will yield info on the algorithms used.