Expressing UTF-16 unicode characters in JavaScript

Question

To express, for example, the character U+10400 in JavaScript, I use "\uD801\uDC00" or String.fromCharCode(0xD801) + String.fromCharCode(0xDC00). How do I figure that out for a given unicode character? I want the following:

var char = getUnicodeCharacter(0x10400);

How do I find 0xD801 and 0xDC00 from 0x10400?

I can't believe that this many years later Javascript is still in the Stone Age regarding Unicode. Having only BMP characters was something that should have gone out the door with Unicode 1.1 something like 15 years ago. Why is Javascript still so broken? — tchrist, Aug 20 '11 at 03:48
@tchrist: because you can't change a language's basic string model without widespread application breakage. Java, .NET and Windows in general are in the same boat: most of the world is afflicted by the UTF-16 curse. Browser JavaScript has a further hurdle in that the DOM standard also requires strings to be indexed by UTF-16 code units. — bobince, Aug 20 '11 at 10:06
@bobince: I agree that the UTF-16 Curse sucks, but it may not be insurmountable. There are still measures that can be taken. You can provide alternate libraries available by explicit declaration that have a code point interface sitting on top the original code unit one. On the other hand, the UCS-2 that afflicts Javascript and many aspects of narrow builds of Python is a scourge, and some of the JVM languages can't make use of the code point interfaces that Java is able to provide if you ask nicely enough. — tchrist, Aug 20 '11 at 10:29
`String.fromCharCode(0xD801) + String.fromCharCode(0xDC00)` can be written as `String.fromCharCode(0xD801, 0xDC00)`. — Mathias Bynens, Feb 02 '12 at 13:08
See the [wikipedia article on UTF-16](http://en.wikipedia.org/wiki/UTF-16). — hmakholm left over Monica, Aug 19 '11 at 19:24
[Formulas to convert between Unicode code points and surrogate pairs](http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae) — Mathias Bynens, Aug 08 '14 at 13:46

Arnaud Le Blanc · Accepted Answer · 2011-08-19T21:24:49.263

17

Based on the wikipedia article given by Henning Makholm, the following function will return the correct character for a code point:

function getUnicodeCharacter(cp) {

    if (cp >= 0 && cp <= 0xD7FF || cp >= 0xE000 && cp <= 0xFFFF) {
        return String.fromCharCode(cp);
    } else if (cp >= 0x10000 && cp <= 0x10FFFF) {

        // we substract 0x10000 from cp to get a 20-bits number
        // in the range 0..0xFFFF
        cp -= 0x10000;

        // we add 0xD800 to the number formed by the first 10 bits
        // to give the first byte
        var first = ((0xffc00 & cp) >> 10) + 0xD800

        // we add 0xDC00 to the number formed by the low 10 bits
        // to give the second byte
        var second = (0x3ff & cp) + 0xDC00;

        return String.fromCharCode(first) + String.fromCharCode(second);
    }
}

edited Aug 19 '11 at 21:24

answered Aug 19 '11 at 19:49

Arnaud Le Blanc

98,321
23
206
194

You can't concatenate `"\u"` with a hex code to get a unicode character. That is the literal syntax. To get a string from a code you must use `String.fromCharCode()`. This will return false: `"\u0001" == "\u"+"0001"` so will this: `"\u0001" == "\\u"+"0001"`. – gilly3 Aug 19 '11 at 20:55
2

Well, I know :) The function purposefully returned the javascript literal for those code points (so, `"\uD801\uDC00"` for `0x10400`). I modified the function to return the character instead. – Arnaud Le Blanc Aug 19 '11 at 21:25

Mathias Bynens · Answer 2 · 2012-06-04T09:28:22.413

How do I find 0xD801 and 0xDC00 from 0x10400?

JavaScript uses UCS-2 internally. That’s why String#charCodeAt() doesn’t work the way you’d want it to.

If you want to get the code point of every Unicode character (including non-BMP characters) in a string, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:

// String#charCodeAt() replacement that only considers full Unicode characters
punycode.ucs2.decode(''); // [119558]
punycode.ucs2.decode('abc'); // [97, 98, 99]

If you don’t need to do it programmatically though, and you’ve already got the character, just use mothereff.in/js-escapes. It will tell you how to escape any character in JavaScript.

Expressing UTF-16 unicode characters in JavaScript

2 Answers2

Linked