0
let uint8Array = new Uint8Array([228, 189, 160, 229, 165, 189]);

alert( new TextDecoder().decode(uint8Array) ); // 你好

How does the encoding of this ended up to be an Asian character?

As I know the UTF-8 is 8 bit. So if I look at utf-8 charset map then I don't any Asian characters till 255.

On investigating the bits

  1. finding bits for the input
    [228, 189, 160, 229, 165, 189].map(i => parseInt(i).toString(2))
    // ["11100100", "10111101", "10100000", "11100101", "10100101", "10111101"]
  1. finding bits for the output
    '你好'.split('').map((e,index) => '你好'.charCodeAt(index).toString(2) )
    // ["100111101100000", "101100101111101"]

Things that are a mystery to me:

  1. total bits in the input are 48 while total bits in output are 30. Why?
  2. Also the bits pattern match at some places but not as whole. Like for 3rd and 6th element in input bit array matches the output bits array.

Is there something i am missing? Feel free to correct me

Lakshaya Sood
  • 92
  • 1
  • 4
  • Does this answer your question? [What is Unicode, UTF-8, UTF-16?](https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) – fdermishin Jan 05 '21 at 20:00
  • 1
    Single byte encoding for UTF8 ends at 0x7f, not 0xff.. IOW: The largest single byte code is 127, then it becomes multi-byte encoding. – Keith Jan 05 '21 at 20:00
  • Output has 32 bits, but the most significant bit is 0 for both symbols and is not displayed, so you see only 30 of them – fdermishin Jan 05 '21 at 20:03
  • sorry the total bits in input is 48. Question updated – Lakshaya Sood Jan 06 '21 at 02:06

1 Answers1

0

I feel a bit dumb after asking this question.

on a little bit of exploration through the UTF-8 RFC and google. I found that my understanding of UTF-8 was wrong.

I thought that UTF-8 will have a maximum of 8 bits but that wrong.

In reality

UTF-8 is a variable-length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.

this helped: How many characters can UTF-8 encode?

Lakshaya Sood
  • 92
  • 1
  • 4