How to handle Unicode characters beyond 16-bits in JDK 19?

Question

In JDK 19 the release notes states the following:

The java.lang.Character class supports Unicode Character Database of 14.0 level, which adds 838 characters, for a total of 144,697 characters

Java char data type is only 16-bit. Hence it can have a maximum combination of just 65,536 Unicode characters. I notice that the String class can also parse byte arrays as Unicode strings only if it is recognized by the StandardCharsets class. If we look at the implementation, basically they all support only 16-bit Unicode in nature.

If I tried to use the constructor: String(int[] codePoints, int offset, int count) or Character.toString(int codePoint), then they will just convert every character beyond 16-bit into 0x3F (1-byte values).

public static void main(String[] arg)
{
    int[] code = {0x1E020, 0x1E021, 0x1E022, 0x1E023};
    String str = new String(code,0,4);
    try (FileWriter writer = new FileWriter("C:\\test.txt"))
    {
        writer.write(str);
    }
}

If you open test.txt file using Hex Editor, you'll notice that it will convert all 4 Unicode characters into a character with a value of 0x3F. Using any constants within StandardCharsets does not solve the issue.

So how to process 32-bit Unicode characters? Is there a standard Java class implementation which can accept and automatically process Unicode characters which consists of 4-bytes?

I'm not sure what you mean by "then they will just convert every character beyond 16-bit into 0x3F (1-byte values). You can see this when writing the resulting String object into a file". When writing a string to a file, you would need to specify an encoding (implicitly or explicitly), so how can you "see" 0x3F? What encoding are you using, and what is wrong with 0x3F? What did you expect instead? Please, show a [mcve]. — Sweeper, May 29 '23 at 10:40
Does this answer your question? [Java Unicode encoding](https://stackoverflow.com/questions/2533097/java-unicode-encoding) — aled, May 29 '23 at 11:24
*then they will just convert every character beyond 16-bit into 0x3F (1-byte values). You can see this when writing the resulting String object into a file* Not the case. Try the folowing: `int[] codepoints = { 0x10149, 0x1014A };String s = new String(codepoints, 0, codepoints.length);byte[] bytes = s.getBytes("UTF-8");System.out.println(HexFormat.ofDelimiter(" ").formatHex(bytes));` — g00se, May 29 '23 at 11:44
Java Strings and chars are [UTF-16](https://en.wikipedia.org/wiki/UTF-16). All char values are UTF-16 values. There’s plenty of methods in the Character class for converting UTF-16 surrogate values to Unicode codepoints and vice versa. — VGR, May 29 '23 at 15:15
@Sweeper you can see 0x3F by using Hex Editor then open the file. — Alan Garnium, May 29 '23 at 15:59
*you can see 0x3F by using Hex Editor **then** open the file* (My emphasis) That doesn't make sense. If you mean "when you open the file in a hex editor, you see 0x3F in places" that probably just means that you encoded the file wrongly. The way to do it properly would be something like: `try (Writer out = Files.newBufferedWriter(Path.of("utf8.txt"), StandardCharsets.UTF_8)) { out.write(s); }` (where 's' is the string). If that *still* doesn't work then it would mean that you started with a 'bad' string — g00se, May 29 '23 at 16:01
@VGR those are not able to process 32-bit Unicode characters. — Alan Garnium, May 29 '23 at 16:11
*those are not able to process 32-bit Unicode characters* But they are: `String s = "";System.out.println(s);` prints ``, the same 2-codepoint string as in the first code I posted, which prints `f0 90 85 89 f0 90 85 8a` — g00se, May 29 '23 at 16:18
What do you mean? String has a constructor that takes ints and a `codePoints()` method to convert to Unicode codepoints. Character has charCount, toChars, toCodePoint, and several get*, is*, and to* methods that take a 32-bit int argument. — VGR, May 29 '23 at 16:41
@g00se Please see the source code in the post. ```0x1E020``` is not equal to ```f0 90 85 89``` or ```f0 90 85 8a```. Hence it is why the standard String class is unable to accept 32-bit Unicode characters. Do you know other class which can properly process ```0x1E020``` as a single Unicode character? — Alan Garnium, May 30 '23 at 03:53
@VGR able to take 32-int as input does not mean that it can process 32-bit Unicode characters properly. The byte sequence in the ```int``` array, in the source code above should be the same as it is shown by using Hex Editor. Have you tried them yourself or you just read the documentation? — Alan Garnium, May 30 '23 at 03:58
`FileWriter` uses the JVM default file encoding, which might well be something like `windows-1252`, which means characters not representable in `windows-1252` will be replaced by a question mark `?`, which in `windows-1252` (and other single-byte encodings derived from ASCII) is encoded using `0x3F`. Create `FileWriter` with an explicit character set. — Mark Rotteveel, May 30 '23 at 08:43
@AlanGarnium, you don't seem to be reading my comments properly. I had already explained how to write to file [here](https://stackoverflow.com/questions/76355989/how-to-handle-unicode-characters-beyond-16-bits-in-jdk-19?noredirect=1#comment134648323_76355989) — g00se, May 30 '23 at 09:02
What Mark said. The default charset before Java 18 was the system charset, which on Windows is a one-byte windows-125x charset that is only capable of (almost) 256 characters. Instead of FileWriter, use `try (BufferedWriter writer = Files.newBufferedWriter(Path.of("C:\\test.txt")))`, which is guaranteed to write UTF-8. — VGR, May 30 '23 at 11:17
@MarkRotteveel @VGR @g00se the idea is not to have it encoded into UTF-8. But to have it stored natively with the actual **32-bit Unicode codepoints**. So that when an external text reader application reads it (with full support of Unicode 14.0), it will natively display the intended Unicode character (above ```0xFFFF```) — Alan Garnium, May 30 '23 at 12:22
*But to have it stored natively with the actual 32-bit Unicode codepoints* But that's not how character encoding works and unless you have some special text reading application it would fall over if you did that. The code you've been given will allow any proper Unicode-enabled editor to open the file correctly — g00se, May 30 '23 at 12:26
If you really insist, you could always use the UTF-32 (or UTF-32LE or UTF-32BE) character set (e.g. `Charset.forName("UTF-32")`), but I think you'll not find editors which can actually read UTF-32. — Mark Rotteveel, May 30 '23 at 15:59

How to handle Unicode characters beyond 16-bits in JDK 19?

0 Answers0