How do we get from the two 16-bit code points used to represent a non-BMP character in UTF-16 to the single code point of the character in Unicode?

Question

In ES6, when we use codePointAt(0) on a string with one character in it ('') that has a Unicode code point value larger than U+FFFF (therefore not part of the Basic Multilingual Plane), we get the code point 134071. The string still actually has two code points in it, that represent this 134071 value.

> (55362).toString(16)
'd842'
> (57271).toString(16)
'dfb7'
> "\ud842\udfb7"
''
> const j = "\ud842\udfb7"
undefined
> j
''
> j.codePointAt(0)
134071
> j.codePointAt(1)
57271
>

My question is how do we go from the two code points 55362 and 57271 to the single code point 134071. I am talking about the mathematical relationship here.

Also, why can we still get access to the code point at position 1, but we can't get access to the individual code point at position 0?

@gman this question is not answered by the question you linked. You closed this question mistakenly. — evianpring, Dec 04 '19 at 07:13
this is the duplicate: https://stackoverflow.com/questions/8868432/how-are-surrogate-pairs-calculated — evianpring, Dec 04 '19 at 07:17
explanation of the UTF-16 algorithm with an example, in both directions: https://stackoverflow.com/a/58215052/46395 — daxim, Dec 04 '19 at 11:05
@evianpring You are getting your terminology wrong. A string contains [UTF-16](https://en.wikipedia.org/wiki/UTF-16) *codeunits*, not Unicode *codepoints*. Codepoints outside the BMP are represented as *surrogate pairs*. The string's `codePointAt()` method looks at the codeunit at the given index, and if it begins a surrogate pair then the whole pair is decoded, otherwise the codeunit is returned as-is. This is documented behavior — Remy Lebeau, Dec 04 '19 at 18:24

How do we get from the two 16-bit code points used to represent a non-BMP character in UTF-16 to the single code point of the character in Unicode?

0 Answers0