5

Please, look at this script operating on a (theoretically possible) string:

<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title></title>
<script src="jquery.js"></script>
<script>
    $(function () {
        $("#click").click(function () {
            var txt = $('#high-unicode').text();
            var codes = '';
            for (var i = 0; i < txt.length; i++) {
                if (i > 0) codes += ',';
                codes += txt.charCodeAt(i);
            }
            alert(codes);
        });
    });
</script>
</head>
<body>
<span id="click">click</span><br />
<span id="high-unicode">&#x1D465;<!-- mathematical italic small x -->&#xF31E0;<!-- some char from Supplementary Private Use Area-A -->A<!-- char A -->&#x108171;<!-- some char from Supplementary Private Use Area-B --></span>
</body>
</html>

Instead of "55349,56421,56204,56800,65,56288,56689", is it possible to get "119909,995808,65,1081713"? I've read more-utf-32-aware-javascript-string and Q: What’s the algorithm to convert from UTF-16 to character codes? + Q: Isn’t there a simpler way to do this? from unicode.org/faq/utf_bom, but I'm not sure how to use this info.

lyrically wicked
  • 1,185
  • 12
  • 26
  • 1
    I am very disappointed to learn that Javascript seems to encode its strings with UTF-16. I didn't know that. +1 to your question for teaching me that. I don't understand why anyone or anything would want to use UTF-16 for pretty much any purpose. It's a variable width encoding with all the disadvantages that entails, it's not self-synchronizing like UTF-8 is, it's not ASCII-compatible like UTF-8 is, etc... Is has all the worst properties! – Celada Feb 04 '13 at 04:39

1 Answers1

6

It looks like you have to decode surrogate pairs manually. For example:

function decodeUnicode(str) {
    var r = [], i = 0;
    while(i < str.length) {
        var chr = str.charCodeAt(i++);
        if(chr >= 0xD800 && chr <= 0xDBFF) {
            // surrogate pair
            var low = str.charCodeAt(i++);
            r.push(0x10000 + ((chr - 0xD800) << 10) | (low - 0xDC00));
        } else {
            // ordinary character
            r.push(chr);
        }
    }
    return r;
}

Complete code: http://jsfiddle.net/twQWU/

georg
  • 211,518
  • 52
  • 313
  • 390
  • `const str = ''; str.codePointAt(0);` does return `119909` for me. This is actually similar to Java, which also uses UTF-16 internally. Especially in situations where you need to convert text between different encodings, it is probably the best solution to completely represent the text as a series of integers (code points) and then use `String.fromCodePoint(codePoint)` to convert the text back to text form – Roman Vottner May 05 '21 at 02:50