In JDK 19 the release notes states the following:
The java.lang.Character class supports Unicode Character Database of 14.0 level, which adds 838 characters, for a total of 144,697 characters
Java char
data type is only 16-bit. Hence it can have a maximum combination of just 65,536 Unicode characters. I notice that the String
class can also parse byte arrays as Unicode strings only if it is recognized by the StandardCharsets
class. If we look at the implementation, basically they all support only 16-bit Unicode in nature.
If I tried to use the constructor: String(int[] codePoints, int offset, int count)
or Character.toString(int codePoint)
, then they will just convert every character beyond 16-bit into 0x3F
(1-byte values).
public static void main(String[] arg)
{
int[] code = {0x1E020, 0x1E021, 0x1E022, 0x1E023};
String str = new String(code,0,4);
try (FileWriter writer = new FileWriter("C:\\test.txt"))
{
writer.write(str);
}
}
If you open test.txt file using Hex Editor, you'll notice that it will convert all 4 Unicode characters into a character with a value of 0x3F
. Using any constants within StandardCharsets
does not solve the issue.
So how to process 32-bit Unicode characters? Is there a standard Java class implementation which can accept and automatically process Unicode characters which consists of 4-bytes?