javascript and string manipulation w/ utf-16 surrogate pairs

Question

I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware.

I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings.

function sortSurrogates(str){
  var cp = [];                 // array to hold code points
  while(str.length){           // loop till we've done the whole string
    if(/[\uD800-\uDFFF]/.test(str.substr(0,1))){ // test the first character
                               // High surrogate found low surrogate follows
      cp.push(str.substr(0,2)); // push the two onto array
      str = str.substr(2);     // clip the two off the string
    }else{                     // else BMP code point
      cp.push(str.substr(0,1)); // push one onto array
      str = str.substr(1);     // clip one from string 
    }
  }                            // loop
  return cp;                   // return the array
}

My question is, is there something simpler I'm missing? I see so many people reiterating that javascript deals with utf-16 natively, yet my testing leads me to believe, that may be the data format, but the functions don't know it yet. Am I missing something simple?

EDIT: To help illustrate the issue:

var a = "0123456789"; // U+0030 - U+0039 2 bytes each
var b = ""; // U+1D7D8 - U+1D7E1 4 bytes each
alert(a.length); // javascript shows 10
alert(b.length); // javascript shows 20

Twitter sees and counts both of those as being 10 characters long.

Basic manipulation. Twitter doesn't return links inline, just plain text and urls and indices to where the urls belong. The indices are based on code points and not 16 bit characters. Also I have a textarea for formatting tweets. Javascript treats a simple character count as a count of 16 bit hunks rather than individual code points. I can work it out, just don't want to head off in the wrong direction without asking the pros if there isn't something simpler. — BentFX, Jul 30 '11 at 21:03
I'm sitting here mulling this over and I think I've got it, unless someone's got something simpler. With a little creative prototyping the arrays should very nearly plug into my existing code, and would also fit nicely in my function treasure chest. If the simplest way to deal with a mix of 2 and 4 byte characters is to parse them into arrays, then I just have to prototype the arrays to make them act more like strings. If no one jumps in with an elegant answer, I'll be back in a couple days with a answer that's almost 1/4 decent. — BentFX, Jul 30 '11 at 21:41
**Javascript uses UCS-2 internally, *which is not* UTF-16.** It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so. As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit. — tchrist, Jul 30 '11 at 21:43
@tchrist: What do you mean by that? JavaScript strings, which are what is visible to developers, are UTF-16 encoded. — Tim Down, Jul 30 '11 at 22:00
Yes! Thank you tchrist. After reading the wiki I wanted to say javascript was using ucs-2 but didn't know enough about it to feel confident in saying so. Yes! Twitter is counting code points. I've been thinking on it hard. It needs to be an object, that stores the string as an array of code points, with prototypes matching the main string manipulation functions. I think I can do this. :) — BentFX, Jul 30 '11 at 22:07
@Tim They are visible as UCS-2 strings of separate code units, not as Unicode strings of code points. You can prove this to yourself with regexes. Try writing `[-]` in a pattern and see what happens. It’s simply broken. If Javascript actually used UTF-16, I would be able to write `document.write(String.fromCharCode(0x1D49C))` and would not have to write **nor be allowed to write** `document.write(String.fromCharCode(0xD835,0xDC9C))` in its stead. This is broken UCS-2 nonsense. — tchrist, Jul 30 '11 at 22:08
@BentFX: I found [this recent bug report](https://processing-js.lighthouseapp.com/projects/41284/tickets/868), which seems related, but I don’t quite know what to make of it. — tchrist, Jul 30 '11 at 23:31
@tchrist I looked at that bug report and I get no joy. As I read it, the codePointAt(pos); function still needs pos defined in code units. — BentFX, Jul 31 '11 at 08:54

tchrist · Accepted Answer · 2011-07-30T22:33:37.380

24

Javascript uses UCS-2 internally, which is not UTF-16. It is very difficult to handle Unicode in Javascript because of this, and I do not suggest attempting to do so.

As for what Twitter does, you seem to be saying that it is sanely counting by code point not insanely by code unit.

Unless you have no choice, you should use a programming language that actually supports Unicode, and which has a code-point interface, not a code-unit interface. Javascript isn't good enough for that as you have discovered.

It has The UCS-2 Curse, which is even worse than The UTF-16 Curse, which is already bad enough. I talk about all this in OSCON talk, Unicode Support Shootout: The Good, the Bad, & the (mostly) Ugly .

Due to its horrible Curse, you have to hand-simulate UTF-16 with UCS-2 in Javascript, which is simply nuts.

Javascript suffers from all kinds of other terrible Unicode troubles, too. It has no support for graphemes or normalization or collation, all of which you really need. And its regexes are broken, sometimes due to the Curse, sometimes just because people got it wrong. For example, Javascript is incapable of expressing regexes like [-]. Javascript doesn’t even support casefolding, so you can’t write a pattern like /ΣΤΙΓΜΑΣ/i and have it correctly match στιγμας.

You can try to use the XRegEXp plugin, but you won’t banish the Curse that way. Only changing to a language with Unicode support will do that, and just isn’t one of those.

edited Jul 30 '11 at 22:33

answered Jul 30 '11 at 22:05

tchrist

78,834
30
123
180

I know nothing about graphanemes or nominalization but if I could simulate a rudimentary wide character aware string in javascript my current issue would be solved :) – BentFX Jul 30 '11 at 22:21
@BentFX: [This answer](http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp) suggest you will not easily be made happy. I am sorry. It appears that the ᴇᴄᴍᴀScript standard wickedly deﬁnes string values *not* as sequences of Unicode characters, but rather as sequences of 16-bit “code units”. This seems not to have been updated this side of Millennium. We’ve known it takes 21 bits for a Unicode character for about 15 years now. If I find a way to do it, I’ll update my answer. – tchrist Jul 30 '11 at 22:27
@BentFX I fixed the links. I don’t have much good to say there I am afraid, because Javascript came out the worst of all seven languages I surveyed. As I said, I am really sorry about that; I can think of no reason they have dragged their feet on this for so very many years, and will update my answer if I find a solution. – tchrist Jul 30 '11 at 22:34
Yeah, I can't really see it from your perspective, but I've got a good idea where your standing, and you've got a far better view of the landscape than I. I would be happy with a few functions to sensibly count characters and pick substrings based on the code points rather than 16 bit chunks. I have no hope of making regex or String.fromCharCode() work. Just want to be able to cut and paste coherently. – BentFX Jul 30 '11 at 22:43
1

@BentFX: I've knocked up code to do the basics of what you want. See my answer. – Tim Down Jul 31 '11 at 00:25
2

EcmaScript 5 says implementations can be either UTF-16 or UCS-2. "A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with **either UCS-2 or UTF-16** as the adopted encoding form, implementation level 3." from [chapter 2 paragraph 2](http://es5.github.com/#x2) – Mike Samuel Jul 31 '11 at 02:02
@Mike Please explain how to correctly interpret all Unicode code points from Unicode Version 3.0 or later **using UCS-2.** I do not believe that you can do so. – tchrist Jul 31 '11 at 02:40
@tchrist, I never claimed I could. I was merely pointing out that your first sentence is wrong : "Javascript uses UCS-2 internally..." – Mike Samuel Jul 31 '11 at 02:45
@Mike: Alright, then would you say that sometimes Javascript uses UCS-2 internally, and sometimes it uses UTF-16, and therefore you cannot rely on UTF-16 being there? – tchrist Jul 31 '11 at 02:47
4

@tchrist, I agree. If you want to work on many interpreters, you cannot rely on all of them representing supplemental codepoints as UTF-16. If you only need to work on one or a few interpreters, you can test: `var div = document.createElement("DIV"); div.innerHTML = "x10000;"; var isUtf16 = div.firstChild.nodeValue.charCodeAt(0) == 0xd800;` – Mike Samuel Jul 31 '11 at 03:06
@Mike: Excellent trick to know! I'll add it to my Unicode slides, because I'm trying to give every language some tidbits that help make Unicode less frustrating when using that language. – tchrist Jul 31 '11 at 14:27
2

`/ΣΤΙΓΜΑΣ/i.test('στιγμας')` returns `true` in my more or less up-to-date versions of Chrome, Firefox, Edge and even Internet Explorer. Current [ECMA-262 v9.0](https://www.ecma-international.org/ecma-262/9.0/index.html#sec-ecmascript-language-types-string-type) defines, that string has to use UTF-16. Also Current Javascript has `String.fromCodePoint` and `String.prototype.codePointAt` which actually work with codepoints above BMP. Maybe you could update your answer and mention, that modern javascript uses UTF-16? – T S Jul 05 '18 at 18:57
@tchrist Also [rumpel's answer](https://stackoverflow.com/a/37644635) shows that modern javascript can handle the questioned usecase internally, so you should mention, that your answer only applies, if you still need to support old javascript engines. – T S Jul 05 '18 at 19:09

Tim Down · Answer 2 · 2017-06-22T08:21:25.347

I've knocked together the starting point for a Unicode string handling object. It creates a function called UnicodeString() that accepts either a JavaScript string or an array of integers representing Unicode code points and provides length and codePoints properties and toString() and slice() methods. Adding regular expression support would be very complicated, but things like indexOf() and split() (without regex support) should be pretty easy to implement.

var UnicodeString = (function() {
    function surrogatePairToCodePoint(charCode1, charCode2) {
        return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
    }

    function stringToCodePointArray(str) {
        var codePoints = [], i = 0, charCode;
        while (i < str.length) {
            charCode = str.charCodeAt(i);
            if ((charCode & 0xF800) == 0xD800) {
                codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
            } else {
                codePoints.push(charCode);
            }
            ++i;
        }
        return codePoints;
    }

    function codePointArrayToString(codePoints) {
        var stringParts = [];
        for (var i = 0, len = codePoints.length, codePoint, offset, codePointCharCodes; i < len; ++i) {
            codePoint = codePoints[i];
            if (codePoint > 0xFFFF) {
                offset = codePoint - 0x10000;
                codePointCharCodes = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
            } else {
                codePointCharCodes = [codePoint];
            }
            stringParts.push(String.fromCharCode.apply(String, codePointCharCodes));
        }
        return stringParts.join("");
    }

    function UnicodeString(arg) {
        if (this instanceof UnicodeString) {
            this.codePoints = (typeof arg == "string") ? stringToCodePointArray(arg) : arg;
            this.length = this.codePoints.length;
        } else {
            return new UnicodeString(arg);
        }
    }

    UnicodeString.prototype = {
        slice: function(start, end) {
            return new UnicodeString(this.codePoints.slice(start, end));
        },

        toString: function() {
            return codePointArrayToString(this.codePoints);
        }
    };


    return UnicodeString;
})();

var ustr = UnicodeString("fbar");
document.getElementById("output").textContent = "String: '" + ustr + "', length: " + ustr.length + ", slice(2, 4): " + ustr.slice(2, 4);

<div id="output"></div>

Thanks for the effort. I am really new to javascript object structure and will take a lot from your example. Really my coding is recreational and I do enjoy working out the puzzle. Seems like Unicode in Javascript is like a Rubik's Cube, except there are fewer correct solutions. :) — BentFX, Jul 31 '11 at 09:03
It's a sad statement on the current state of Javascript when a StackOverflow answer with 4 upvotes is the best way to handle UTF-16 Unicode. Great work on this though! Working perfectly for my current task (slicing Tweets containing Emoji icons) — Matt Vukas, Jul 11 '14 at 05:28

score 6 · Answer 3 · answered May 28 '12 at 07:28

Here are a couple scripts that might be helpful when dealing with surrogate pairs in JavaScript:

ES6 Unicode shims for ES3+ adds the String.fromCodePoint and String.prototype.codePointAt methods from ECMAScript 6. The ES3/5 fromCharCode and charCodeAt methods do not account for surrogate pairs and therefore give wrong results.
Full 21-bit Unicode code point matching in XRegExp with \u{10FFFF} allows matching any individual code point in XRegExp regexes.

score 5 · Answer 4 · answered Jun 05 '16 at 17:12

Javascript string iterators can give you the actual characters instead of the surrogate code points:

>>> [..."0123456789"]
["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
>>> [...""]
["", "", "", "", "", "", "", "", "", ""]
>>> [..."0123456789"].length
10
>>> [...""].length
10

score 3 · Answer 5 · answered Jul 31 '11 at 12:59

This is along the lines of what I was looking for. It needs better support for the different string functions. As I add to it I will update this answer.

function wString(str){
  var T = this; //makes 'this' visible in functions
  T.cp = [];    //code point array
  T.length = 0; //length attribute
  T.wString = true; // (item.wString) tests for wString object

//member functions
  sortSurrogates = function(s){  //returns array of utf-16 code points
    var chrs = [];
    while(s.length){             // loop till we've done the whole string
      if(/[\uD800-\uDFFF]/.test(s.substr(0,1))){ // test the first character
                                 // High surrogate found low surrogate follows
        chrs.push(s.substr(0,2)); // push the two onto array
        s = s.substr(2);         // clip the two off the string
      }else{                     // else BMP code point
        chrs.push(s.substr(0,1)); // push one onto array
        s = s.substr(1);         // clip one from string 
      }
    }                            // loop
    return chrs;
  };
//end member functions

//prototype functions
  T.substr = function(start,len){
    if(len){
      return T.cp.slice(start,start+len).join('');
    }else{
      return T.cp.slice(start).join('');
    }
  };

  T.substring = function(start,end){
    return T.cp.slice(start,end).join('');
  };

  T.replace = function(target,str){
    //allow wStrings as parameters
    if(str.wString) str = str.cp.join('');
    if(target.wString) target = target.cp.join('');
    return T.toString().replace(target,str);
  };

  T.equals = function(s){
    if(!s.wString){
      s = sortSurrogates(s);
      T.cp = s;
    }else{
        T.cp = s.cp;
    }
    T.length = T.cp.length;
  };

  T.toString = function(){return T.cp.join('');};
//end prototype functions

  T.equals(str)
};

Test results:

// plain string
var x = "0123456789";
alert(x);                    // 0123456789
alert(x.substr(4,5))         // 45678
alert(x.substring(2,4))      // 23
alert(x.replace("456","x")); // 0123x789
alert(x.length);             // 10

// wString object
x = new wString("");
alert(x);                    // 
alert(x.substr(4,5))         // 
alert(x.substring(2,4))      // 
alert(x.replace("","x")); // x
alert(x.length);             // 10

@Tim Yeah the structures different, but the main thing is prototyping the needed functions. The main difference is you encode the code units into code points. I choose not to do that because I have no use for the true code points, javascript can't display them, so why bother. For my uses it is enough just to separate them so they can be counted and split at reasonable points. Have Fun! — BentFX, Jul 31 '11 at 14:28
Fair enough, if you don't need the real code points then don't use them. You may want them if you needed to send a Unicode string to the server. — Tim Down, Jul 31 '11 at 14:54
From angular js: "...".replace(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g, function(value) { var hi = value.charCodeAt(0); var low = value.charCodeAt(1); return '' + (((hi - 0xD800) * 0x400) + (low - 0xDC00) + 0x10000) + ';'; }) This is creating an entity encoded value safe to insert into attributes or element bodies. — Ajax, Jan 04 '16 at 09:31

javascript and string manipulation w/ utf-16 surrogate pairs

5 Answers5

Linked

Related