0

How many bytes are required to store one character in:

  • Microsoft's implementation of the .NET framework, version 4
  • JavaScript, as implemented by Microsoft Internet Explorer 8?
icktoofay
  • 126,289
  • 21
  • 250
  • 231
Kuttan Sujith
  • 7,889
  • 18
  • 64
  • 95

2 Answers2

1

.net and JavaScript both are UTF-16:

Represents each Unicode code point as a sequence of one or two 16-bit integers. Most common Unicode characters require only one UTF-16 code point, although Unicode supplementary characters (U+10000 and greater) require two UTF-16 surrogate code points. Both little-endian and big-endian byte orders are supported.

So it can be 16bit or 32 bit.

Community
  • 1
  • 1
Damith
  • 62,401
  • 13
  • 102
  • 153
  • 1
    Neither is 16-bit. Both use UTF-16 (with some peculiarities around JavaScript), which is a variable-length encoding, the name notwithstanding. – Michael Petrotta May 30 '12 at 05:02
  • 1
    UTF-16 means that it uses 16-bit *code units* to represent the 21-bit *code points* of Unicode. So you need one or two such code units for a single code point, depending on the code point. – Joey May 30 '12 at 05:08
1

Both .NET and JavaScript use UTF-16. UTF-16 is a so-called variable-length encoding which uses 16-bit code units to represent Unicode code points (which are 21 bits in length). Historically it came from UCS-2 when Unicode was still a 16-bit code (which was deemed insufficient later, thus the expansion to 21 bits).

Since UTF-16 uses 16-bit code units the code itself is a 16-bit code, but to represent a character, you'll have to look a bit closer to what you actually mean:

  1. Character in the Unicode sense means Unicode code point which is probably your intended meaning. Here are two cases:

    1. A code point in the range  U+0000 to  U+FFFF takes up two bytes, because it can be represented in a single UTF-16 code unit (here code unit and code point are identical).
    2. A code point in the range U+10000 to U+10FFFF takes up four bytes because it has to be represented using two UTF-16 code units.
  2. Character in the usual meaning often refers to graphemes, actually, which would be what we perceive as a single character. Those can have arbitrarily many diacritics, or may be ligatures that are formed out of multiple code points by the rendering engine. Long story short in this case: Those can be arbitrarily long since they can consist of several code points.

Joey
  • 344,408
  • 85
  • 689
  • 683