convert ucs-4 to ucs-2

Question

The unicode value of ucs-4 character '' is 0001f923, it gets auto changed to the corresponding value of \uD83E\uDD23 when being copied into java code in intelliJ IDEA.

Java only supports ucs-2, so there occurs a transformation from ucs-4 to ucs-2.

I want to know the logic of the transformation, but didn't find any material about it.

yelliver · Accepted Answer · 2019-09-16T10:34:49.460

https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

U+010000 to U+10FFFF

0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no greater than 0x10FFFF.

The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.

The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.

Now with input code point \U1F923:

\U1F923 - \U10000 = \UF923
\UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
\UD800 + \U3E = \UD83E
\UDC00 + \U123 = \UDD23
The result: \UD83E\UDD23

Programming:

public static void main(String[] args) {
    int input = 0x1f923;
    int x = input - 0x10000;

    int highTenBits = x >> 10;
    int lowTenBits = x & ((1 << 10) - 1);

    int high = highTenBits + 0xd800;
    int low = lowTenBits + 0xdc00;

    System.out.println(String.format("[%x][%x]", high, low));
}

score 1 · Answer 2 · answered Sep 16 '19 at 11:27

Though String contains Unicode as a char array where char is a two byte UTF-16BE encoding, there also is support for UCS4.

UCS4: UTF-32, "code points":

Unicode code points, UCS4, are represented in java as int.

int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();

There are encodings, transformations, of code points to UTF-16 and UTF-8 which require longer sequences of respectively 2-byte or 1-byte values. The encoding is chosen such that the 2/1-byte values will be different from any other value. That means that such a value will not erroneously match "/" or any other string search. That is realized by high bits starting with 1... and then bits of the code point in big-endian format (most significant first).

Rather than searching for UCS4 and UCS2 a search for UTF-16 will yield info on the algorithms used.

this answer is also very useful! – wongoo Sep 17 '19 at 09:36 — wongoo, Sep 17 '19 at 09:36

convert ucs-4 to ucs-2

2 Answers2