1

I've read several answers on here describing how to convert a single 16-bit hex value to a Unicode character (UChar in ICU). What I am not clear on is how to convert a multiple code (2+ hex strings) to a 3-bit Unicode character. For instance how do I represent ...

U+1F6A3 U+200D U+2642 U+FE0F ‍♂️

As a single 32-bit Unicode character. When the input are the strings "U+1F6A3" "U+200D" "U+2642" "U+FE0F" (not the 16-bit values, those are the literal strings).

phuclv
  • 37,963
  • 15
  • 156
  • 475
user14998757
  • 161
  • 1
  • 1
  • 6
  • 3
    Note: Unicode is complex, and Unicode is more than characters. Many thing you think it is a single characters can take many coldepoints. Unicode doesn't set a limit (I think there were a recommendation, around 15 or 31 combining characters together main characters). Then you can combine many of "such things" into a single grapheme (or grapheme cluster). No way you can do with a fix number of bytes. [Luckily is font and shaper engine which should take care of this]. – Giacomo Catenazzi Jan 25 '22 at 14:14

1 Answers1

3

There's no such thing as a "32-bit Unicode character"

Unicode is a 21-bit charset, and UTF-32 is just an encoding where each code point is encoded by a single code unit. But UTF-32 is not a fixed-length encoding. Many characters can't be encoded by a single UTF-32 code unit like the ones you posted above. U+1F6A3 U+200D U+2642 U+FE0F is simply encoded as 16 bytes of 0x1F6A3 0x200D 0x2642 0xFE0F, period. You can't make it 32-bit. Also note that U+1F6A3 isn't 16-bit, because Unicode is 21-bit as mentioned previously, and must be encoded by 2 code units in UTF-16

For more information read Isn't a 2-byte char datatype insufficient to deal with the concept of "characters" in a Unicode string?

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • So all 64 bits of that is just inline in text like that? – user14998757 Jan 25 '22 at 14:20
  • 3
    @user14998757 that's a series of 128 bits in UTF-32, not 64. In UTF-16 it's `D83D DEA3 200D 2642 FE0F` which is 10 bytes = 80 bits. And many other characters can be combined from even more bytes. It's just a linear byte stream, just open the file in any hex editor, or hex dump the string and see. Please read the link above – phuclv Jan 25 '22 at 14:23
  • Got it! Thank you! – user14998757 Jan 25 '22 at 14:27
  • if an answer helps you then please click the green checkmark to [accept](https://stackoverflow.com/help/accepted-answer) it – phuclv May 06 '22 at 12:23