How does the decoding works in Javascript TextDecoder with Asian charaters?

Question

let uint8Array = new Uint8Array([228, 189, 160, 229, 165, 189]);

alert( new TextDecoder().decode(uint8Array) ); // 你好

How does the encoding of this ended up to be an Asian character?

As I know the UTF-8 is 8 bit. So if I look at utf-8 charset map then I don't any Asian characters till 255.

On investigating the bits

    [228, 189, 160, 229, 165, 189].map(i => parseInt(i).toString(2))
    // ["11100100", "10111101", "10100000", "11100101", "10100101", "10111101"]

    '你好'.split('').map((e,index) => '你好'.charCodeAt(index).toString(2) )
    // ["100111101100000", "101100101111101"]

Things that are a mystery to me:

total bits in the input are 48 while total bits in output are 30. Why?
Also the bits pattern match at some places but not as whole. Like for 3rd and 6th element in input bit array matches the output bits array.

Is there something i am missing? Feel free to correct me

Does this answer your question? [What is Unicode, UTF-8, UTF-16?](https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) — fdermishin, Jan 05 '21 at 20:00
Single byte encoding for UTF8 ends at 0x7f, not 0xff.. IOW: The largest single byte code is 127, then it becomes multi-byte encoding. — Keith, Jan 05 '21 at 20:00
Output has 32 bits, but the most significant bit is 0 for both symbols and is not displayed, so you see only 30 of them — fdermishin, Jan 05 '21 at 20:03

score 0 · Answer 1 · answered Jan 06 '21 at 03:23

I feel a bit dumb after asking this question.

on a little bit of exploration through the UTF-8 RFC and google. I found that my understanding of UTF-8 was wrong.

I thought that UTF-8 will have a maximum of 8 bits but that wrong.

In reality

UTF-8 is a variable-length encoding with a minimum of 8 bits per character. Characters with higher code points will take up to 32 bits.

1 Answers1