How many bytes are required to store one character in .NET and JavaScript?

Question

How many bytes are required to store one character in:

Microsoft's implementation of the .NET framework, version 4
JavaScript, as implemented by Microsoft Internet Explorer 8?

score 1 · Answer 1 · edited May 23 '17 at 12:27

1

.net and JavaScript both are UTF-16:

Represents each Unicode code point as a sequence of one or two 16-bit integers. Most common Unicode characters require only one UTF-16 code point, although Unicode supplementary characters (U+10000 and greater) require two UTF-16 surrogate code points. Both little-endian and big-endian byte orders are supported.

So it can be 16bit or 32 bit.

edited May 23 '17 at 12:27

Community

1
1

answered May 30 '12 at 04:56

Damith

62,401
13
102
153

1

Neither is 16-bit. Both use UTF-16 (with some peculiarities around JavaScript), which is a variable-length encoding, the name notwithstanding. – Michael Petrotta May 30 '12 at 05:02
1

UTF-16 means that it uses 16-bit *code units* to represent the 21-bit *code points* of Unicode. So you need one or two such code units for a single code point, depending on the code point. – Joey May 30 '12 at 05:08

Joey · Accepted Answer · 2012-05-30T07:42:54.443

Both .NET and JavaScript use UTF-16. UTF-16 is a so-called variable-length encoding which uses 16-bit code units to represent Unicode code points (which are 21 bits in length). Historically it came from UCS-2 when Unicode was still a 16-bit code (which was deemed insufficient later, thus the expansion to 21 bits).

Since UTF-16 uses 16-bit code units the code itself is a 16-bit code, but to represent a character, you'll have to look a bit closer to what you actually mean:

Character in the Unicode sense means Unicode code point which is probably your intended meaning. Here are two cases:
1. A code point in the range U+0000 to U+FFFF takes up two bytes, because it can be represented in a single UTF-16 code unit (here code unit and code point are identical).
2. A code point in the range U+10000 to U+10FFFF takes up four bytes because it has to be represented using two UTF-16 code units.
Character in the usual meaning often refers to graphemes, actually, which would be what we perceive as a single character. Those can have arbitrarily many diacritics, or may be ligatures that are formed out of multiple code points by the rendering engine. Long story short in this case: Those can be arbitrarily long since they can consist of several code points.

How many bytes are required to store one character in .NET and JavaScript?

2 Answers2