17

the following doesn't seem correct

"".charCodeAt(0);  // returns 55357 in both Firefox and Chrome

that's a Unicode character named ROCKET (U+1F680), the decimal should be 128640.

this is for a unicode app am writing. Seems most but not ALL chars from unicode 6 all stuck at 55357.

how can I fix it? Thanks.

Xah Lee
  • 16,755
  • 9
  • 37
  • 43

3 Answers3

10

JavaScript is using UTF-16 encoding; see this article for details:

Characters outside the BMP, e.g. U+1D306 tetragram for centre (), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character.

The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate.

The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.

You can decode the surrogate pair like this:

codePoint = (text.charCodeAt(0) - 0xD800) * 0x400 + text.charCodeAt(1) - 0xDC00 + 0x10000

Complete code can be found can be found in the Mozilla documentation for charCodeAt.

Daniel
  • 15,944
  • 2
  • 54
  • 60
  • all great answers. Josh Lee's link to https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/String/charCodeAt , which contain code to fix the problem. – Xah Lee Mar 04 '13 at 03:16
  • Daniel, would you consider adding that Mozilla link? as it contains working code. thanks. – Xah Lee Mar 04 '13 at 03:18
5

Tried this out:

> "".charCodeAt(0);
55357

> "".charCodeAt(1);
56960

Related questions on SO:

You might want to take a look at this too:

Community
  • 1
  • 1
Samuel Liew
  • 76,741
  • 107
  • 159
  • 260
0

I think it's because they're returning you the first code unit UTF-16 encoding of that character. I'm not sure there's much you can do, because they're returning a 16-bit value -- I would probably try manually decoding the character from the first two code units and then encoding it in UTF-32, which seems to be what you want.

user541686
  • 205,094
  • 128
  • 528
  • 886