6

I have written a personal web app that uses charCodeAt() to convert text that is input by the user into the relevant character codes (for example is converted to 8839 for storage), which is then sent to Perl, which sends them to MySQL. To retrieve the input text, the app uses fromCharCode() to convert the numbers back to text.

I chose to do this because Perl's unicode support is very hard to deal with correctly. So Perl and MySQL only see numbers, which makes life a lot simpler.

My question is can I depend on fromCharCode() to always convert a number like 8834 to the relevant character? I don't know what standard it uses, but let's say it uses UTF-8, if it is changed to use UTF-16 in the future, this will obviously break my program if there is no backward compatibility.

I know that my ideas about these concepts aren't that clear, therefore please care to clarify if I've shown a misunderstanding.

Ikram Hawramani
  • 157
  • 1
  • 8
  • You can also use `escape` / `unescape` or `encodeURIComponent` /`decodeURIComponent` to encode and decode these data. – Thai Jun 05 '11 at 10:37

6 Answers6

9

fromCharCode and toCharCode deal with Unicode code points, i.e. numbers between 0 and 65535(0xffff), assuming all characters are in the Basic-Multilingual Plane(BMP). Unicode and the code points are permanent, so you can trust them to remain the same forever.

Encodings such as UTF-8 and UTF-16 take a stream of code points (numbers) and output a byte stream. JavaScript is somewhat strange in that characters outside the BMP have to be constructed by two calls to toCharCode, according to UTF-16 rules. However, virtually every character you'll ever encounter (including Chinese, Japanese etc.) is in the BMP, so your program will work even if you don't handle these cases.

One thing you can do is convert the numbers back into bytes (in big-endian int16 format), and interpret the resulting text as UTF-16. The behavior of fromCharCode and toCharCode is fixed in current JavaScript implementations and will not ever change.

phihag
  • 278,196
  • 72
  • 453
  • 469
  • thanks! What happens if a user inputs something that is not in that 'Basic-Multilingual Plane'? Is `toCharCode` unable to deal with them? – Ikram Hawramani Jun 05 '11 at 10:09
  • @Hawramani No, it just returns two strangely-looking characters that are not Unicode codepoints (between `0xd800` and `0xdbff` for the first one, `0xdc00` and `0xdff` for the second one). Updated the answer. – phihag Jun 05 '11 at 10:13
  • 2
    So `fromCharCode` and `toCharCode` obviously don't deal with code *points*, but code *units*. That means that you have to deal with individual sequences of code units, i.e. convert them to scalar values on either the JavaScript or the Perl side. – Philipp Jun 05 '11 at 12:20
  • 2
    @Hawramani Yes that’s right. Instead of writing `document.write(String.fromCharCode(0x1D49C))` with one Unicode character, you must manually emit two UCS-2 code points that when assembled on the other wide, will become the correct thing. For example, `document.write(String.fromCharCode(0xD835,0xDC9C))`. Very nasty. – tchrist Jul 30 '11 at 22:31
5

I chose to do this because Perl's unicode support is very hard to deal with correctly.

This is ɴᴏᴛ true!

Perl has the strongest Unicode support of any major programming language. It is much easier to work with Unicode if you use Perl than if you use any of C, C++, Java, C, Python, Ruby, PHP, or Javascript. This is not hyperbole and boosterism from uneducated, blind allegiance.; it is a considered appraisal based on more than ten years of professional experience and study.

The problems encountered by naïve users are virtually always because they have deceived themselves about what Unicode is. The number-one worst brain-bug is thinking that Unicode is like ASCII but bigger. This is absolutely and completely wrong. As I wrote elsewhere:

It’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s true of nothing more than the stultified ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much much more that just the assignment of numbers to glyphs: rules for collation and comparisons, three forms of casing, non-letter casing, multi-codepoint casefolding, both canonical and compatible composed and decomposed normalization forms, serialization forms, grapheme clusters, word- and line-breaking, scripts, numeric equivs, widths, bidirectionality, mirroring, print widths, logical ordering exclusions, glyph variants, contextual behavior, locales, regexes, multiple forms of combining classes, multiple types of decompositions, hundreds and hundreds of critically useful properties, and much much much more‼

Yes, that’s a lot, but it has nothing to do with Perl. It has to do with Unicode. That Perl allows you to access these things when you work with Unicode is not a bug but a feature. That those other languages do not allow you full access to Unicode can by no means be construed as a point in their favor: rather, those are all major bugs of the highest possible severity, because if you cannot work with Unicode in the 21st century, then that language is a primitive, broken, and fundamentally useless for the demanding requirements of modern text processing.

Perl is not. And it is a gazillion times easier to do those things right in Perl than in those other languages; in most of them, you cannot even begin to work around their design flaws. You’re just plain screwed. If a language doesn’t provide full Unicode support, it is not fit for this century; discard it.

Perl makes Unicode infinitely easier than languages that don’t let you use Unicode properly can ever do.

In this answer, you will find at the front, Seven Simple Steps for dealing with Unicode in Perl, and at the bottom of that same answer, you will find some boilerplate code that will help. Understand it, then use it. Do not accept brokenness. You have to learn Unicode before you can use Unicode.

And that is why there is no simple answer. Perl makes it easy to work with Unicode, provided that you understand what Unicode really is. And if you’re dealing with external sources, you are doing to have to arrange for that source to use some sort of encoding.

Also read up on all the stuff I said about . Those are things that you truly need to understand. Another brokenness issue that falls out of Rule #49 is that Javascript is broken because it doesn’t treat all valid Unicode code points in exactly the same way irrespective of their plane. Javascript is broken in almost all the other ways, too. It is unsuitable for Unicode work. Just Rule #34 will kill you, since you can’t get Javascript to follow the required standard about what things like \w are defined to do in Unicode regexes.

It’s amazing how many languages are utterly useless for Unicode. But Perl is most definitely not one of those!

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Thanks for the answer. It was that answer that you quote which made me not want to deal with Perl Unicode. For what I needed (saving occasional math characters in a database), it seemed simpler to just send the data as arrays of numbers instead of dealing with the complexity of Perl and MySQL regarding Unicode. – Ikram Hawramani Jun 05 '11 at 17:17
  • 1
    And by the way, Perl is one of my favorite languages. You can see that my avatar is a picture of the cover of Higher Order Perl. :) – Ikram Hawramani Jun 05 '11 at 17:25
  • 2
    Somewhat hardline, this answer. I accept JavaScript is not ideal for Unicode work, but there's not really an alternative for client-side scripting so we have to make the best of it. +1 nonetheless. – Tim Down Jun 05 '11 at 23:35
4

In my opinion it won't break.

Read Joel Spolsky's article on Unicode and character encoding. Relevant part of the article is quoted below:

Every letter in every alphabet is assigned a number by the Unicode consortium which is written like this: U+0639. This number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. The English letter A would be U+0041.

It does not matter whether this magical number is encoded in utf-8 or utf-16 or any other encoding. The number will still be the same.

Ozair Kafray
  • 13,351
  • 8
  • 59
  • 84
  • 1
    But `charCodeAt` doesn't give you a code point. See e.g. https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/String/charCodeAt – Philipp Jun 05 '11 at 12:27
  • @Phillip: Thanks for pointing that out. In that case it will be a problem. – Ozair Kafray Jun 05 '11 at 13:16
4

As pointed out in other answers, fromCharCode() and toCharCode() deal with Unicode code points for any code point in the Basic Multilingual Plane (BMP). Strings in JavaScript are UCS-2 encoded, and any code point outside the BMP is represented as two JavaScript characters. None of these things are going to change.

To handle any Unicode character on the JavaScript side, you can use the following function, which will return an array of numbers representing the sequence of Unicode code points for the specified string:

var getStringCodePoints = (function() {
    function surrogatePairToCodePoint(charCode1, charCode2) {
        return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
    }

    // Read string in character by character and create an array of code points
    return function(str) {
        var codePoints = [], i = 0, charCode;
        while (i < str.length) {
            charCode = str.charCodeAt(i);
            if ((charCode & 0xF800) == 0xD800) {
                codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
            } else {
                codePoints.push(charCode);
            }
            ++i;
        }
        return codePoints;
    }
})();

var str = "";
var codePoints = getStringCodePoints(s);

console.log(str.length); // 2
console.log(codePoints.length); // 1
console.log(codePoints[0].toString(16)); // 1d306
Tim Down
  • 318,141
  • 75
  • 454
  • 536
  • Thanks for the answer. So far I've had no experience with shifting bits (in this context or any other) and would love to get a good understanding of them, can you recommend any books/resources that deal with these techniques? – Ikram Hawramani Jun 05 '11 at 17:22
3

JavaScript Strings are UTF-16 this isn't something that is going to be changed.

But don't forget that UTF-16 is variable length encoding.

Artyom
  • 31,019
  • 21
  • 127
  • 215
  • What does variable length encoding mean, and will it have an effect on my app which uses those two functions to convert letters back and forth? – Ikram Hawramani Jun 05 '11 at 10:10
  • @Hawramani: That means that each scalar value is represented by a variable number of 16-bit code units. Yes, this is something you have to deal with because you no longer work with code points, but with code units. Not that hard though, just make it explicit in the Perl script that you have a sequence of UTF-16 code units, not a sequence of Unicode code points. – Philipp Jun 05 '11 at 12:26
  • @Philipp: Hmm. What my program does is it converts the values in an input into an array. For example if the user enters the word 'Programming', it becomes this array: `[80,114,111,103,114,97,109,109,105,110,103]`, this is then sent to Perl as a string, which sends it to MySQL without any processing. MySQL sees it only as a string too. All of the unicode processing is done in JavaScript, Perl and MySQL could be using ASCII and it would still work. To me it seems that what you mention won't cause an issue in this scenario, am I right? – Ikram Hawramani Jun 05 '11 at 15:29
  • @Hawramani: How do you convert your code unit array to a Perl string? – Philipp Jun 05 '11 at 16:02
0

In 2018, you can use String.codePointAt() and String.fromCodePoint().

These methods work even if a character is not in the Basic-Multilingual Plane(BMP).

Simon Hi
  • 2,838
  • 1
  • 17
  • 17