How is the binary data of each character in a unicode string separated from the binary data of the next character?

Question

in JavaScript I can get the Unicode from a single character using the charCodeAt-method.

When I use this to convert a string to the unicode number I will get following results:

"A".charCodeAt(0) === 65

"४".charCodeAt(0) === 2410

Converting these decimal numbers to binary numbers would be following:

"A".charCodeAt(0).toString(2) === "1000001" // 1 byte, with padleft: "01000001"

"४".charCodeAt(0).toString(2) === "100101101010" // 2 bytes, with padleft: "0000100101101010"

This means, that the ४-symbol uses 2 bytes to get represented. But how does the reading process know that?

It could be also two different characters " " and "j":

String.fromCharCode(parseInt("00001001", 2)) === " " // horizontal tabulator

String.fromCharCode(parseInt("01101010", 2)) === "j".

So how does the binary reading process know how many bytes are used for one single character? Is there something like a separator?

That's the whole point of encodings. Even if a character could be represented with one bit it doesn't mean it will take one bit. Read up on encodings, especially UTF-8 and UTF-16. — Sami Kuhmonen, Apr 01 '17 at 11:38
"Character" is an ambiguous term. `charCodeAt` returns a UTF-16 code unit; one or two encode a Unicode codepoint. One or more codepoints form a grapheme cluster. One or more grapheme clusters from a word/sentence/number/etc or ["Zalgo text"](http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work) art. — Tom Blodget, Apr 01 '17 at 17:23

dan04 · Accepted Answer · 2017-05-05T23:49:25.857

Unicode maps each character* to an integer “code point”. Valid code points are U+0000 through U+10FFFF, allowing for more than a million characters (although most of these aren't assigned yet).

(* It's a bit more complicated than that because there are “combining characters” where one user-perceived character can be represented by more than one code point. And some characters have both pre-composed and decomposed representations. For example, the Spanish letter ñ can be represented either as the single code point U+00F1, or as the sequence U+006E U+0303 (n + combining tilde).)

There are three different encoding forms (not counting offbeat ones like UTF-9 and UTF-18) that can be used to represent Unicode characters in a string.

UTF-32 is the most straightforward one: Each code point is represented by a 32-bit integer. So, for example:

A (U+0041) = 0x00000041
ñ (U+00F1) = 0x000000F1
४ (U+096A) = 0x0000096A
(U+1F4AA) = 0x0001F4AA

While simple, UTF-32 uses a lot of memory (4 bytes for every character), and is rarely used.

UTF-16 uses 16-bit code units. Characters U+0000 through U+FFFF (the “Basic Multilingual Plane”) are represented straightforwardly as a single code unit, while characters U+10000 through U+10FFFF are represented as a “surrogate pair”. Specifically, you subtract 0x10000 from the code point (resulting in a 20-bit number), and use these bits to fill out the binary sequence 110110xxxxxxxxxx 110111xxxxxxxxxx. For example,

A (U+0041) = 0x0041
ñ (U+00F1) = 0x00F1
४ (U+096A) = 0x096A
(U+1F4AA) = 0xD83D 0xDCAA

In order for this system to work, the code points U+D800 through U+DFFF are permanently reserved for this UTF-16 surrogate mechanism and will never be assigned to “real” characters.

It's a backwards-compatibility “hack” to allow the full 17-“plane” Unicode code space to be represented on 1990's-era platforms that were designed with the expectation that Unicode characters would always be 16-bits. This includes Windows NT, Java, and JavaScript.

UTF-8 represents Unicode code points with sequences of 1-4 bytes. Specifically, each character is represented with the shortest of:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So, with the example from earlier:

A (U+0041) = 0x41
ñ (U+00F1) = 0xC3 0xB1
४ (U+096A) = 0xE0 0xA5 0xAA
(U+1F4AA) = 0xF0 0x9F 0x92 0xAA

This encoding has the property that the number of bytes in the sequence can be determined from the value of the first byte. Furthermore, leading bytes can be easily distinguished from continuation bytes:

0xxxxxxx = single-byte character (ASCII-compatible)
10xxxxxx = continuation byte of 2-, 3-, or 4-byte character
110xxxxx = lead byte of 2-byte character
1110xxxx = lead byte of 3-byte character
11110xxx = lead byte of 4 byte character
11111xxx = not used

score 2 · Answer 2 · answered Apr 01 '17 at 11:50

2

Internally, JS uses UTF-16, that is, every code point from the plane 0 takes one 16-bit unit, and higher plane characters are represented as surrogate pairs and use two units.

Source encoding is UTF-8, that is, a character takes 1 to 4 bytes, depending on the leading bits of the first byte.

answered Apr 01 '17 at 11:50

georg

211,518
52
313
390

Ahh ok so we do have a fixed length for each character! That solves the problem. Thanks! – Teemoh Apr 01 '17 at 11:54

How is the binary data of each character in a unicode string separated from the binary data of the next character?

2 Answers2