1

I'm kinda confused about unicode characters codepoints conversion to UTF-16 and I'm looking for someone who can explain it to me in the easiest way possible.

For characters like "" we get;

d801dc8c -->  UTF-16
0001048c -->  UTF-32
f090928c -->  UTF-8
66700    -->  Decimal Value

So, UTF-16 hexadecimal value converts to "11011000 00000001 11011100 10001100" which is "3624000652" in decimal value, so my question is how do we got this value in hexadecimal?? and how can we convert it back to the real codepoint of "66700". ???

UTF-32 hexadecimal value converts to "00000000 0000001 00000100 10001100" which is "66700" in decimal, but UTF-16 value doesn't convert back to "66700" and instead we get "3624000652".

How the conversion is actually happening??

Like for UTF-8,, 4-byte encoding it goes like 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

But how this happens in UTF-16 ?? If anyone can explain it to me in easiest possible way then that would be a huge help, because I've been searching for it for like past few days and haven't been able to find a good answer that makes sense to me.

Websites I used for conversion were Branah.com and rapidtables.com

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • 2
    UTF-16 uses [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF) for characters beyond the Plane 0. – georg Oct 02 '19 at 19:21
  • `` (U+1048C) is hex `0xF0 0x90 0x92 0x8C` in UTF-8, hex `0xDB01 0xDC8C` in UTF-16, hex `0x0001048C` in UTF-32. It is easier to read them if you express them in terms of *codeunits* instead of in *raw bytes*. UTF-8 encodes a codepoint in 1-4 8bit codeunits. UTF-16 encodes a codepoint in 1-2 16bit codeunits. UTF-32 encodes a codepoint in 1 32bit codeunit. Also remember that UTF-16 and UTF-32 codeunits are subject to *endian*, whereas UTF-8 codeunits are not. – Remy Lebeau Oct 02 '19 at 21:10
  • @georg I would like to know more about surrogate pairs, how they work?? – learningweb Oct 02 '19 at 21:15
  • Consider researching at [http://unicode.org/faq](http://unicode.org/faq). I can say that as an English speaker, "surrogate" seems strange. I would use different jargon: "[alibi](https://english.stackexchange.com/a/223979)". What I mean is that certain codepoints are reserved so they don't get confused with UTF-16 code units, of the same numeric value, that require code unit pairs. From the perspective of UTF-16 there is no reason to call those values "surrogate" (or "alibi"). From the perspective of Unicode codepoints, those values are "alibis" because they refer to something elsewhere. – Tom Blodget Oct 03 '19 at 00:24
  • Possible duplicate of [Is UTF-16 compatible with UTF-8?](https://stackoverflow.com/questions/32499846/is-utf-16-compatible-with-utf-8) – tripleee Oct 03 '19 at 08:24
  • I'm somewhat starting to understand surrogate pairs now but one confusion I've right now is that how you decide which pairs you are going to use for certain codepoints?? or you pick any randomly and them remember to use the same ones when decoding??? – learningweb Oct 04 '19 at 10:24
  • I finally figured it out, thanks everyone for your time and efforts. – learningweb Oct 04 '19 at 15:43

1 Answers1

1

how do we got this value

how can we convert it back to the real codepoint

about surrogate pairs, how they work?

Study the algorithm for encoding to UTF-16:

my $U = 66_700; # code point
if ($U > 0xffff) {
    my $U_prime = $U - 0x1_0000; # some intermediate value 0x0_0000 .. 0xF_FFFF
    sprintf '%d', $U_prime;      # 1164
    sprintf '0x%04X', $U_prime;  # 0x048C
    sprintf '0b%020b', $U_prime; # 0b00000000010010001100

    my $high_ten_bits = $U_prime << 10;  # range 0x000 .. 0x3FF
    sprintf '0b%010b', $high_ten_bits;   # 0b0000000001

    my $low_ten_bits = $U_prime ^ 2**10; # range 0x000 .. 0x3FF
    sprintf '0b%010b', $low_ten_bits;    # 0b0010001100

    my $W1 = $high_ten_bits + 0xD800; # high surrogate
    sprintf '%d', $W1;      # 55297
    sprintf '0x%04X', $W1;  # 0xD801
    sprintf '0b%016b', $W1; # 0b1101100000000001

    my $W2 = $low_ten_bits + 0xDC00;  # low surrogate
    sprintf '%d', $W2;      # 56460
    sprintf '0x%04X', $W2;  # 0xDC8C
    sprintf '0b%016b', $W2; # 0b1101110010001100

    # finally emit the concatenation of W1 and W2

    # your original arithmetic checks out:
    ($W1 << 16) + $W2   # 3624000652
}

Reverse direction:

my @octets = (0xD8, 0x01, 0xDC, 0x8C);
my $W1 = ($octets[0] << 8) + $octets[1];
sprintf '%d', $W1;      # 55297
sprintf '0x%04X', $W1;  # 0xD801
sprintf '0b%016b', $W1; # 0b1101100000000001

my $W2 = ($octets[2] << 8) + $octets[3];
sprintf '%d', $W2;      # 56460
sprintf '0x%04X', $W2;  # 0xDC8C
sprintf '0b%016b', $W2; # 0b1101110010001100

my $high_ten_bits = $W1 - 0xD800;
sprintf '0b%010b', $high_ten_bits; # 0b0000000001

my $low_ten_bits = $W2 - 0xDC00;
sprintf '0b%010b', $low_ten_bits;  # 0b0010001100

my $U_prime = ($high_ten_bits << 10) + $low_ten_bits;
sprintf '%d', $U_prime;      # 1164
sprintf '0x%04X', $U_prime;  # 0x048C
sprintf '0b%020b', $U_prime; # 0b00000000010010001100

my $U = $U_prime + 0x1_0000;
sprintf '%d', $U; # 66700
Community
  • 1
  • 1
daxim
  • 39,270
  • 4
  • 65
  • 132
  • 1
    For a beginner to who is learning Unicode conversion this is very confusing and way too complex to understand. So, can you explain it in a simple way? Kinda like an answer on this question - https://stackoverflow.com/questions/58207814/how-utf-16-and-utf-8-conversion-happen?noredirect=1#comment102806489_58207814 – learningweb Oct 04 '19 at 10:05
  • Which part is confusing? You must tell me so I can improve my answer. – You can run the code yourself in a debugger or manually follow along each step with an electronic calculator. – daxim Oct 04 '19 at 10:38
  • Hey, thanks for taking your time to help, I really appreciate it. After dabbling in it for a while I finally figured it out myself but thanks for your time though it means a lot. (Have a nice day) – learningweb Oct 04 '19 at 15:42