0

I'm getting too confused. Why do code points from U+D800 to U+DBFF encode as a single (2 bytes) String element, when using the ECMAScript 6 native Unicode helpers?

I'm not asking how JavaScript/ECMAScript encodes Strings natively, I'm asking about an extra functionality to encode UTF-16 that makes use of UCS-2.

var str1 = '\u{D800}';
var str2 = String.fromCodePoint(0xD800);

console.log(
  str1.length, str1.charCodeAt(0), str1.charCodeAt(1)
);

console.log(
  str2.length, str2.charCodeAt(0), str2.charCodeAt(1)
);

Re-TL;DR: I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string, since my browser's ES6 implementation incorporates UCS-2 encoding in strings, which uses 2 bytes for each character code?

Both of these approaches return a one-element String for the U+D800 code point (char code: 55296, same as 0xD800). But for code points bigger than U+FFFF each one returns a two-element String, the lead and trail. lead would be a number between U+D800 and U+DBFF, and trail I'm not sure about, I only know it helps changing the result code point. For me the return value doesn't make sense, it represents a lead without trail. Am I understanding something wrong?

  • Use `codePointAt` instead of `charCodeAt`. The latter will only return information for the first code unit of a surrogate pair. – 4castle Feb 11 '17 at 21:01
  • @4castle I used charCodeAt() as an example about what's happening, as you can see the result string of both approaches contains only one code unit. –  Feb 11 '17 at 21:02
  • 3
    I'm not sure I totally understand a your question. It produces a one-length string because that is what you asked it to do. Are you asking to understand surrogate pairs? – loganfsmyth Feb 11 '17 at 21:26
  • @loganfsmyth Exactly, I need to understand why that's the result. –  Feb 11 '17 at 21:28
  • Code points above 0xFFFF are represented by two numbers. What's your question exactly? – Michał Miszczyszyn Feb 11 '17 at 21:49
  • 2
    Possible duplicate of http://stackoverflow.com/q/6885879/5217142 - Javascript does not implement strings as Unicode characters. Rather it records a sequence of 16 bit values used to encode Unicode characters. Unfortunately this results in a string length of 2 for a single Unicode character that requires a surrogate pair in UTF-16 encoding. – traktor Feb 11 '17 at 21:49
  • @Traktor53 Yes, this is correct, but that's not what I'm asking for. Did you read the question very well? *"Why does code points from U+D800 to U+DBFF encode as an unique (2 bytes) String element"* (2 bytes = 16 bits), I know that JavaScript uses UCS-2. –  Feb 11 '17 at 22:07
  • @handoncloud Your question is worded weirdly. What exactly do you mean by *unique* in this context? – melpomene Feb 11 '17 at 22:09
  • @melpomene Okay, if you insist: unique: one, a. The code point U+D800 was encoded with only one UCS-2 code units. –  Feb 11 '17 at 22:17
  • 1
    @handoncloud ... OK, that's not what "unique" means. You want "single". – melpomene Feb 11 '17 at 22:18

2 Answers2

6

I think your confusion is about how Unicode encodings work in general, so let me try to explain.

Unicode itself just specifies a list of characters, called "code points", in a particular order. It doesn't tell you how to convert those to bits, it just gives them all a number between 0 and 1114111 (in hexadecimal, 0x10FFFF). There are several different ways these numbers from U+0 to U+10FFFF can be represented as bits.

In an earlier version, it was expected that a range of 0 to 65535 (0xFFFF) would be enough. This can be naturally represented in 16 bits, using the same convention as an unsigned integer. This was the original way of storing Unicode, and is now known as UCS-2. To store a single code point, you reserve 16 bits of memory.

Later, it was decided that this range was not large enough; this meant that there were code points higher than 65535, which you can't represent in a 16-bit piece of memory. UTF-16 was invented as a clever way of storing these higher code points. It works by saying "if you look at a 16-bit piece of memory, and it's a number between 0xD800 and 0xDBF (a "low surrogate"), then you need to look at the next 16 bits of memory as well". Any piece of code which is performing this extra check is processing its data as UTF-16, and not UCS-2.

It's important to understand that the memory itself doesn't "know" which encoding it's in, the difference between UCS-2 and UTF-16 is how you interpret that memory. When you write a piece of software, you have to choose which interpretation you're going to use.

Now, onto Javascript...

Javascript handles input and output of strings by interpreting its internal representation as UTF-16. That's great, it means that you can type in and display the famous character, which can't be stored in one 16-bit piece of memory.

The problem is that most of the built in string functions actually handle the data as UCS-2 - that is, they look at 16 bits at a time, and don't care if what they see is a special "surrogate". The function you used, charCodeAt(), is an example of this: it reads 16 bits out of memory, and gives them to you as a number between 0 and 65535. If you feed it , it will just give you back the first 16 bits; ask it for the next "character" after, and it will give you the second 16 bits (which will be a "high surrogate", between 0xDC00 and 0xDFFF).

In ECMAScript 6 (2015), a new function was added: codePointAt(). Instead of just looking at 16 bits and giving them to you, this function checks if they represent one of the UTF-16 surrogate code units, and if so, looks for the "other half" - so it gives you a number between 0 and 1114111. If you feed it , it will correctly give you 128169.

var poop = '';
console.log('Treat it as UCS-2, two 16-bit numbers: ' + poop.charCodeAt(0) + ' and ' + poop.charCodeAt(1));
console.log('Treat it as UTF-16, one value cleverly encoded in 32 bits: ' + poop.codePointAt(0));
// The surrogates are 55357 and 56489, which encode 128169 as follows:
// 0x010000 + ((55357 - 0xD800) << 10) + (56489 - 0xDC00) = 128169

Your edited question now asks this:

I want to know why the above approaches return a string of length 1. Shouldn't U+D800 generate a 2 length string?

The hexadecimal value D800 is 55296 in decimal, which is less than 65536, so given everything I've said above, this fits fine in 16 bits of memory. So if we ask charCodeAt to read 16 bits of memory, and it finds that number there, it's not going to have a problem.

Similarly, the .length property measures how many sets of 16 bits there are in the string. Since this string is stored in 16 bits of memory, there is no reason to expect any length other than 1.

The only unusual thing about this number is that in Unicode, that value is reserved - there isn't, and never will be, a character U+D800. That's because it's one of the magic numbers that tells a UTF-16 algorithm "this is only half a character". So a possible behaviour would be for any attempt to create this string to simply be an error - like opening a pair of brackets that you never close, it's unbalanced, incomplete.

The only way you could end up with a string of length 2 is if the engine somehow guessed what the second half should be; but how would it know? There are 1024 possibilities, from 0xDC00 to 0xDFFF, which could be plugged into the formula I show above. So it doesn't guess, and since it doesn't error, the string you get is 16 bits long.

Of course, you can supply the matching halves, and codePointAt will interpret them for you.

// Set up two 16-bit pieces of memory
var high=String.fromCharCode(55357), low=String.fromCharCode(56489);
// Note: String.fromCodePoint will give the same answer
// Glue them together (this + is string concatenation, not number addition)
var poop = high + low;
// Read out the memory as UTF-16
console.log(poop);
console.log(poop.codePointAt(0));
IMSoP
  • 89,526
  • 13
  • 117
  • 169
  • I like the answer, but it still doesn't answer to my question. I already know the String#codePointAt() method, inclusively String.fromCodePoint(), but they're unrelated to the actual problem. Anyways it's a cool answer. The problem is on how ES6 is encoding my desired code point using UCS-2. –  Feb 16 '17 at 23:13
  • Yes, I know. I know that, please read the question again and you'll understand the actual problem. The problem is that the browser's *`String.fromCodePoint()`* or *`\u{...}`* is encoding a code point without surrogate pairs. See the question please –  Feb 16 '17 at 23:22
  • I've added a TL;DR description –  Feb 16 '17 at 23:24
  • Though my question was specific already. I'm not sure why everyone is against my question –  Feb 16 '17 at 23:29
  • @handoncloud We're not against it, we're just struggling to understand what answer you were hoping for. I've added an extra section to my answer; does this clarify things? – IMSoP Feb 16 '17 at 23:35
  • I'm not sure if you've tried to mean that Unicode doesn't have an U+D800 code point (U+D801, ..., U+DBFF, etc...), because I expected to see U+D800 encoded into lead and trail too, exactly as asked in the question. From what you meant, I understood that code units used for the lead of surrogate pairs aren't allowed (`0xD800`, etc...). –  Feb 17 '17 at 09:27
  • I couldn't look the edit very well yersterday since my mother switches off the network every 11:00 PM approximately (or 22:00 here). –  Feb 17 '17 at 09:28
  • 1
    Yes, all the Unicode code points from U+D800 through to U+DFFF are reserved, and will never be allocated a meaning. There is no lead and trail combination that would encode U+D800; any number encoded by surrogates will be higher than `0x010000`. You can see that in the formula for decoding them: it adds a fixed `0x10000`, because the aim is to encode values that don't fit in 16 bits. UTF-16 is basically an ugly compromise between a 16-bit encoding and a 32-bit encoding, which can represent `2^21 - 2048` possible code points; it's a clever hack that leaves us with awkward situations like this. – IMSoP Feb 17 '17 at 11:34
  • @handoncloud You could think of the answer to "What code point does 0xD800 represent in UCS-2?" as like "What is the square root of -1?" - there is no meaningful answer. However, if you say "put these 16 bits into memory" (`String.fromCharCode`) and then "tell me what 16 bits are at that point in memory" (`.charCodeAt`) you'll get back what you put in. As I say, `String.fromCodePoint` could have been defined to raise an error if you gave it a reserved code point, but evidently it wasn't, so for a number less than 65536, it just acts the same as `fromCharCode`, and puts it straight into memory. – IMSoP Feb 17 '17 at 11:46
  • That's what I really wanted to know. Then I'll handle these errors in my programming language of course, heheh. I thought from `0xD800` to `0xDFFF` being reserved would cause big waste of character slot, but anyways `0xDFFF - 0xD800 = 2047`. –  Feb 17 '17 at 11:50
  • Technically, yes, UTF-16 *is* a waste character slots, but it was already too late to do it better. Representing Unicode efficiently in memory is tricky; some systems prefer to represent in UTF-8, which is a little trickier to work with, but more efficient for common Western characters. And once you've dealt with that, you've got a whole load more pain to use Unicode properly: is "e, combining accent acute" a length of 2 code points, or 1 grapheme? what happens when you reverse a string with that in? Even "to upper case" is tricky in Unicode, because Turkish has two types of upper case "i"... – IMSoP Feb 17 '17 at 12:11
  • You did mean that Unicode reserves that U+D800 to U+DFFF code points, but now it's only UTF-16 that is reserving them, then? Hm, maybe I can use the UTF-8 then, so I can allow the use of U+D800, right? Or did you try to mean that UTF-16 just causes unnecessarity since it uses from 2 bytes to 4 bytes to represent a character? Just because Latin, for example, is common and it's a waste to use 2 bytes for a Latin letter, right? –  Feb 17 '17 at 12:18
  • 1
    The fact that UTF-16 exists has caused Unicode as a whole to reserve those code points. It's now impossible for Unicode to give meaning to any of those code points, because it would break UTF-16. UTF-16 and UTF-8 are also inefficient in a different way: they take up more bits of memory than they theoretically could. And yes, for a Latin string, UTF-8 will use fewer bits than UTF-16; for some other strings, it will use *more* bits. Theoretically, you could invent a memory layout where each Unicode code point took up 21 bits; it would be horrible to work with though! – IMSoP Feb 17 '17 at 12:22
  • In this case I gotta incorporate UTF-16 in my programming language! –  Feb 17 '17 at 12:29
  • It'll be certainly easier –  Feb 17 '17 at 12:29
  • Man, it's thank to you I'm continuing my SonicScript parser :D –  Feb 17 '17 at 17:29
0

Well, it does this because the specification says it has to:

Together these two say that if an argument is < 0 or > 0x10FFFF, a RangeError is thrown, but otherwise any codepoint <= 65535 is incorporated into the result string as-is.

As for why things are specified this way, I don't know. It seems like JavaScript doesn't really support Unicode, only UCS-2.

Unicode.org has the following to say on the matter:

  • http://www.unicode.org/faq/utf_bom.html#utf16-2

    Q: What are surrogates?

    A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.

  • http://www.unicode.org/faq/utf_bom.html#utf16-7

    Q: Are there any 16-bit values that are invalid?

    A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.

Therefore the result of String.fromCodePoint is not always valid UTF-16 because it can emit unpaired surrogates.

Community
  • 1
  • 1
melpomene
  • 84,125
  • 8
  • 85
  • 148
  • So what I asked is related to a bug or proposital explicity? –  Feb 11 '17 at 22:19
  • *"It seems like JavaScript doesn't really support Unicode, only UCS-2."*, I can't agree. JavaScript can support UTF-16, very well using UCS-2 (lead and trail => 2 code units in UCS-2), and UTF-8 can be obtained from `Uint8Array`. –  Feb 14 '17 at 14:23
  • 1
    @handoncloud That's like saying C "supports" Unicode because it uses 1-byte chars, which can be used to form valid UTF-8 sequences. – melpomene Feb 15 '17 at 21:25
  • UTF-16 isn't UTF-8. And as I said, again............ UCS-2 is perfect to form UTF-16, since it uses 2-byte chars. –  Feb 15 '17 at 22:28
  • @handoncloud UTF-8 isn't UTF-16. And as I said, again: Octets are perfect to form UTF-8, since it uses 1-byte chars. – melpomene Feb 16 '17 at 08:24
  • I know, and I know much better than you. –  Feb 16 '17 at 09:17
  • @handoncloud You can't "form UTF-16 using UCS-2", they are different encodings; that is, different ways of interpreting a series of bits as Unicode code points. They just happen to agree on the interpretation of most sequences of 16 bits. If you interpret a 32 bit sequence (containing a high and low surrogate) as a single code point, you are by definition using UTF-16, not UCS-2. – IMSoP Feb 16 '17 at 21:36
  • @ImSoP They are different, but yes, UCS-2 (it uses two-bytes for each char code) can represent UTF-16, but in this representation the target code point might be 2 code units larger in this string, the first code unit describing the lead, and the second code unit describing the trail. The problem is that the browser implemented it wrong, or the ECMAScript 6 vm itself? By what people meant here, I understand it like that. –  Feb 16 '17 at 21:41
  • @handoncloud If it's "representing UTF-16", it's not UCS-2 - it's UTF-16! UCS-2 is an encoding that looks at 16 bits of data, decides what code point it is, *and stops there*; it can only represent 65536 code points, ever. That is literally the difference between UCS-2 (fixed width) and UTF-16 (variable width). (In fact, since some of those 65536 code points are reserved for special use in UTF-16, UCS-2 can represent slightly fewer than 65536 *valid* code points.) – IMSoP Feb 16 '17 at 21:50
  • @IMSoP I agree with you, but you're wrong at a point, *"it's representing UTF-16"*, it's representing UCS-2 in the ECMAScript 6 implemented in my current browser. Why? Probably your browser is similiar to mine, then execute this on console for example: `'\u{10FFFF}'.length`, then you'll see `2`, the length of UCS-2 code units, or two character codes. An UCS-2 code unit can be up to `0xFFFF` (65535 different characters). –  Feb 16 '17 at 21:55