5

The question is pretty simple: how much RAM in bytes does each character in an ECMAScript/JavaScript string consume?

I am going to guess two bytes, since the standard says they are stored as 16-bit unsigned integers?

Does this mean each character is always two bytes?

Tower
  • 98,741
  • 129
  • 357
  • 507
  • I assume this depends on what character set context you are working in. I'm not sure whether this is (or belongs in) the language standard. Why did you delete your last (very interesting!) question instead of adding the information which browser you were testing in? – Pekka Aug 27 '11 at 20:06
  • @Pekka: I don't really know what to say about the character set. If I have a JavaScript source file which I run through V8 engine, at what point exactly is a character set specified? I believe JS source files are interpret using the same character set which is used for strings. – Tower Aug 27 '11 at 20:18
  • ahh. I'm referring to JS in browsers - from your last question I assumed the same. – Pekka Aug 27 '11 at 20:19
  • 1
    @rFactor, as for source encoding, ECMA-262 explicitly says (at the beginning of chapter 6) that implementations that support source encodings other than UTF-16 must behave as if the source code was transcoded to UTF-16 before being interpreted. – hmakholm left over Monica Aug 27 '11 at 20:35

1 Answers1

8

Yes, I believe that is the case. The characters are probably stored as widestrings or UCS2-strings. They may be UTF-16, in which case they take up two Words (16 bit integers) per character for characters outside the BMP (Basic Multilingual Plane), but I believe these characters are not fully supported. Read This blog post about problems in the UTF16 implementation of ECMA.

Most modern languages store their strings with two byte characters. This way you have full support for all spoken languages. It costs a little extra memory, but that's peanuts for any modern computer with multiGig RAM. Storing the string in more compact UTF8 will cause processing to be more complex and slower. UTF8 is therefore mostly used for transportation only. ASCII supports only Latin alphabet without diacritics. ANSI is still limited and needs a specified code page to make sense.

Section 4.13.16 of ECMA-262 explicitly defines "String value" as a "primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integers". It suggests that programs use these 16-bit values as UTF-16 text, but it is legal simply to use a string to store any immutable array of unsigned shorts.

Note that character size isn't the only thing that makes up the string size. I don't know about the exact implementation (and it might differ), but strings tend to have a 0x00 terminator to make them compatible with PChars. And they probably have some header that contains the string size and maybe some refcounting and even encoding information. A string with one character can easily consume 10 bytes or more (yes, that's 80 bits).

Jamie Treworgy
  • 23,934
  • 8
  • 76
  • 119
GolezTrol
  • 114,394
  • 18
  • 182
  • 210
  • 1
    "Strings in ECMA are always in Unicode" http://icu-project.org/docs/papers/internationalization_support_for_javascript.html#h1 Unicode here meaning the (crippled) UTF16 implementation that I mentioned. When putting these strings in a HTML document, the browser's HTML/XMLdocument will convert this string as needed. – GolezTrol Aug 27 '11 at 20:19
  • Do characters in BMP require only 1 byte in UTF-16 or UCS2? What about UTF-8? – Tower Aug 27 '11 at 20:19
  • There's a difference between a bytes, code units and code points. Unicode has about 1.1million code points, where ASCII has 128. The code unit of ANSI and UTF8 is 1 byte, where it is 2 bytes for UTF-16. That means that every characted in UTF16 takes up 2 bytes or a multiple of 2 bytes. Characters in the BMP take up 1 code unit (2 bytes) in both UTF-16 and UCS2. I believe actually that UCS2 and UTF-16 are the same for those characters in the BMP, but UTF-16 supports characters outside the BMP, although those take up multiple code units. – GolezTrol Aug 27 '11 at 20:25
  • 1
    If anyone feels like a little reading, read about 'The absolute minimum every developer should know about unicode'. It is quite interesting, and it's easier to learn from that page, than from a simple synopsis I'm writing in the comments here: http://www.joelonsoftware.com/articles/Unicode.html – GolezTrol Aug 27 '11 at 20:26
  • @Pekka, the language specification explicitly says 16-bit per element in a string. I have added a reference to the answer. (In principle an implementation could choose to store strings internally as UTF-8, except that would complicate the implementation of `charAt()`). – hmakholm left over Monica Aug 27 '11 at 20:26
  • By the way, I disagree that UTF-8 is inherently slower or more complicated than UTF-16 (correctly). That made sense back when Unicode was 16 bits only, but since the introduction of surrogates, _both_ UTF-8 and UTF-16 require programs to variable-length aware. If anything, UTF-16 is worse because it lets programmers get away with doing it wrong, whereas the same error for UTF-8 would show as soon as you try to use the code in a non-English setting. Today, using UTF-16 over UTF-8 makes sense only when you have a historical commitment to support arbitrary 16-bit sequences. – hmakholm left over Monica Aug 27 '11 at 20:56
  • Indeed, and that's where UCS2 comes into play. That's the version of UTF-16 that actually only supports single code unit characters. In USC2, you have full access to all characters in the BMP, but without the hassle of having multi code-unit characters. Lots of implementations claiming to be UTF-16 are actually just UCS2. But still. UTF-8 is hard as soon as you try to write a single line in French, while the just the single code-point UTF-16 characters allow the use of about every character in every spoken language, including arabic and simplified chinese. – GolezTrol Aug 27 '11 at 21:04
  • @rFactor. If in doubt, just read. I provided a link above, but it's on wikipedia as well: "The first plane (code points U+0000 to U+FFFF) contains the most frequently used characters and is called the Basic Multilingual Plane or BMP. Both UTF-16 and UCS-2 encode valid code points in this range as single 16-bit code units that are numerically equal to the corresponding code points. The code points in the BMP are the only code points that can be represented in UCS-2." http://en.wikipedia.org/wiki/UTF-16/UCS-2 – GolezTrol Aug 27 '11 at 21:07