143

I have a javascript string which is about 500K when being sent from the server in UTF-8. How can I tell its size in JavaScript?

I know that JavaScript uses UCS-2, so does that mean 2 bytes per character. However, does it depend on the JavaScript implementation? Or on the page encoding or maybe content-type?

Paul Biggar
  • 27,579
  • 21
  • 99
  • 152

14 Answers14

101

You can use the Blob to get the string size in bytes.

Examples:

console.info(
  new Blob(['']).size,                             // 4
  new Blob(['']).size,                             // 4
  new Blob(['']).size,                           // 8
  new Blob(['']).size,                           // 8
  new Blob(['I\'m a string']).size,                  // 12

  // from Premasagar correction of Lauri's answer for
  // strings containing lone characters in the surrogate pair range:
  // https://stackoverflow.com/a/39488643/6225838
  new Blob([String.fromCharCode(55555)]).size,       // 3
  new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6)
);
CPHPython
  • 12,379
  • 5
  • 59
  • 71
P Roitto
  • 1,533
  • 1
  • 11
  • 9
86

This function will return the byte size of any UTF-8 string you pass to it.

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

Source

JavaScript engines are free to use UCS-2 or UTF-16 internally. Most engines that I know of use UTF-16, but whatever choice they made, it’s just an implementation detail that won’t affect the language’s characteristics.

The ECMAScript/JavaScript language itself, however, exposes characters according to UCS-2, not UTF-16.

Source

Lauri Oherd
  • 1,383
  • 1
  • 12
  • 14
  • 10
    Use `.split(/%(?:u[0-9A-F]{2})?[0-9A-F]{2}|./)` instead. Your snippet fails for strings that encode to "%uXXXX". – Rob W Jul 18 '14 at 13:39
  • Used for size computation on websocket frames, gives same size for a String frame as chrome dev tools. – user85155 Feb 21 '15 at 20:17
  • 3
    Used for javascript strings uploaded to s3, s3 displays exactly the same size [ (byteCount(s))/ 1024).toFixed(2) + " KiB" ] – user85155 May 26 '15 at 07:58
72

If you're using node.js, there is a simpler solution using buffers :

function getBinarySize(string) {
    return Buffer.byteLength(string, 'utf8');
}

There is a npm lib for that : https://www.npmjs.org/package/utf8-binary-cutter (from yours faithfully)

MrWhite
  • 43,179
  • 8
  • 60
  • 84
Offirmo
  • 18,962
  • 12
  • 76
  • 97
  • this returns `5` for `"\x80\u3042"` while Ruby's `bytesize` returns `4` (see https://apidock.com/ruby/String/bytesize) – Micael Levi Jan 17 '22 at 01:15
  • 1
    @MicaelLevi Hi, not an expert in Ruby, but it's possible that JavaScript and Ruby don't internally encode strings the same. Cf. other answers to this question: Ruby must be using UTF-8 while JavaScript seems to be using UCS-2. – Offirmo Feb 10 '22 at 22:33
  • 1
    There is no reason to use Buffer anymore. Blob and TextEncoder exist built in and it's more cross env. friendlier. – Endless Mar 08 '23 at 11:34
41

String values are not implementation dependent, according the ECMA-262 3rd Edition Specification, each character represents a single 16-bit unit of UTF-16 text:

4.3.16 String Value

A string value is a member of the type String and is a finite ordered sequence of zero or more 16-bit unsigned integer values.

NOTE Although each value usually represents a single 16-bit unit of UTF-16 text, the language does not place any restrictions or requirements on the values except that they be 16-bit unsigned integers.

Community
  • 1
  • 1
Christian C. Salvadó
  • 807,428
  • 183
  • 922
  • 838
  • 9
    My reading of that passage doesn't imply implementation independence. – Paul Biggar Feb 08 '10 at 04:59
  • 4
    UTF-16 is not guaranteed, only the fact of the strings stored as 16-bit ints. – bjornl Oct 25 '10 at 14:06
  • It's only implementation-dependent with regards to UTF-16. The 16-bit character description is universal. – Panzercrisis May 08 '15 at 14:56
  • 1
    I think internally Firefox could even use 1 byte per character for some strings.... https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/ – Michal Charemza Mar 26 '16 at 22:21
  • 1
    *UTF-16 is explicitly not allowed* the way I'm reading it. UTF-16 characters may have up to 4 bytes, but the spec says "values must be 16-bit unsigned integers". This means JavaScript string values are a subset of UTF-16, however, any UTF-16 string using 3 or 4 bytes characters would not be allowed. – whitneyland Oct 13 '17 at 16:27
  • @Lee To my knowledge, UTF-16 characters cannot be 3 bytes, only 2 or 4. – BlackVegetable Feb 22 '18 at 14:56
28

These are 3 ways I use:

  1. TextEncoder
new TextEncoder().encode("myString").length
  1. Blob
new Blob(["myString"]).size
  1. Buffer
Buffer.byteLength("myString", 'utf8')
Saiansh Singh
  • 583
  • 5
  • 16
Hong Ly
  • 381
  • 3
  • 4
22

Try this combination with using unescape js function:

const byteAmount = unescape(encodeURIComponent(yourString)).length

Full encode proccess example:

const s  = "1 a ф № @ ®"; // length is 11
const s2 = encodeURIComponent(s); // length is 41
const s3 = unescape(s2); // length is 15 [1-1,a-1,ф-2,№-3,@-1,®-2]
const s4 = escape(s3); // length is 39
const s5 = decodeURIComponent(s4); // length is 11
Saiansh Singh
  • 583
  • 5
  • 16
Kinjeiro
  • 917
  • 11
  • 10
  • 4
    The `unescape` JavaScript function is deprecated and should not be used to decode Uniform Resource Identifiers (URI). [Source](http://msdn.microsoft.com/en-us/library/dz4x90hk(v=vs.94).aspx) – Lauri Oherd Aug 30 '12 at 21:26
  • 1
    @LauriOherd I know the comment is old, but: In this answer, `unescape` is not used, to *decode* URIs. It is used to convert `%xx` sequences into single characters. As `encodeURIComponent` encodes a string as UTF-8, representing codeunits either as its corresponding ASCII character or as a `%xx` sequence, calling `unescape(encodeURIComponent(...))` results in a [binary string](https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary) containing the UTF-8 representation of the original string. Calling `.length` correctly gives the size in bytes of the string encoded as UTF-8. – T S Aug 09 '19 at 21:17
  • 1
    And yes (`un`)`escape` is deprecated since 1999 but it's still available in every browser... - That said, there is good reason to deprecate it. There's basically no way, to correctly use them (except for en-/decoding UTF8 in combination with `en`-/`decodeURI`(`Component`) - or at least I don't know any other useful application for (`un`)`escape` ). And today there are better alternatives to encode/decode UTF8 (`TextEncoder`, etc.) – T S Aug 09 '19 at 21:23
14

Note that if you're targeting node.js you can use Buffer.from(string).length:

var str = "\u2620"; // => "☠"
str.length; // => 1 (character)
Buffer.from(str).length // => 3 (bytes)
maerics
  • 151,642
  • 46
  • 269
  • 291
10

The size of a JavaScript string is

  • Pre-ES6: 2 bytes per character
  • ES6 and later: 2 bytes per character, or 5 or more bytes per character

Pre-ES6
Always 2 bytes per character. UTF-16 is not allowed because the spec says "values must be 16-bit unsigned integers". Since UTF-16 strings can use 3 or 4 byte characters, it would violate 2 byte requirement. Crucially, while UTF-16 cannot be fully supported, the standard does require that the two byte characters used are valid UTF-16 characters. In other words, Pre-ES6 JavaScript strings support a subset of UTF-16 characters.

ES6 and later
2 bytes per character, or 5 or more bytes per character. The additional sizes come into play because ES6 (ECMAScript 6) adds support for Unicode code point escapes. Using a unicode escape looks like this: \u{1D306}

Practical notes

  • This doesn't relate to the internal implemention of a particular engine. For example, some engines use data structures and libraries with full UTF-16 support, but what they provide externally doesn't have to be full UTF-16 support. Also an engine may provide external UTF-16 support as well but is not mandated to do so.

  • For ES6, practically speaking characters will never be more than 5 bytes long (2 bytes for the escape point + 3 bytes for the Unicode code point) because the latest version of Unicode only has 136,755 possible characters, which fits easily into 3 bytes. However this is technically not limited by the standard so in principal a single character could use say, 4 bytes for the code point and 6 bytes total.

  • Most of the code examples here for calculating byte size don't seem to take into account ES6 Unicode code point escapes, so the results could be incorrect in some cases.

Community
  • 1
  • 1
whitneyland
  • 10,632
  • 9
  • 60
  • 68
  • 6
    Just wondering, if size is 2 bytes per character, why does `Buffer.from('test').length` and `Buffer.byteLength('test')` equal 4 (in Node) and `new Blob(['test']).size` also equals 4? – user1063287 Jul 15 '19 at 09:42
  • Pre-ES6: UTF-16 is allowed: See [ECMA-262 3rd edition (from 1999)](https://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262,%203rd%20edition,%20December%201999.pdf): Page one says UCS2 or UTF-16 is allowed. Page 5, definition of string value: "... Although each value usually represents a single 16-bit unit of UTF-16 text, ...". On page 81 is a table, that shows how matching surrogate pairs have to be encoded as four UTF-8 bytes. – T S Aug 09 '19 at 21:49
  • "per character" - If by that you mean, per "user-perceived character" ([spec](https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) , [simpler explanation](http://utf8everywhere.org/#characters)) it could be any number of 16bit code units. If you meant per "codepoint", it can either be [one or two 16bit code units in UTF-16](http://www.unicode.org/versions/Unicode12.1.0/ch03.pdf#G31699). (It can't be 2.5 code units (or how do you get 5 bytes?)) – T S Aug 09 '19 at 22:58
  • Whether each element in a javascript string ([16-bit unsigned integer values (“elements”)](https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-types-string-type)) is actually internally represented by two bytes is not defined in the standard. (And how could it be - As long as the interface provided to the javascript program follows the standard everything works as intended.) Mozilla for example can use [just one byte per codepoint if the string only contains latin1](https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/) – T S Aug 09 '19 at 23:11
  • Unicode code point escapes have nothing to do with string length - it's just a new way to represent strings in the source code. (`'\u{1F600}'.length===2`,`'\u{1F600}'==='\uD83D\uDE00'`,`'\u{1F600}'===''`) – T S Aug 09 '19 at 23:18
  • @user1063287 As the question wasn't very clear, this answer talks about the size of the string *in the javascript engine*, where it's stored as UTF-16. Your example encodes the string as UTF-8 and gets the size of the UTF-8 representation, which is probably what the question author had in mind. – T S Aug 09 '19 at 23:25
8

UTF-8 encodes characters using 1 to 4 bytes per code point. As CMS pointed out in the accepted answer, JavaScript will store each character internally using 16 bits (2 bytes).

If you parse each character in the string via a loop and count the number of bytes used per code point, and then multiply the total count by 2, you should have JavaScript's memory usage in bytes for that UTF-8 encoded string. Perhaps something like this:

      getStringMemorySize = function( _string ) {
        "use strict";

        var codePoint
            , accum = 0
        ;

        for( var stringIndex = 0, endOfString = _string.length; stringIndex < endOfString; stringIndex++ ) {
            codePoint = _string.charCodeAt( stringIndex );

            if( codePoint < 0x100 ) {
                accum += 1;
                continue;
            }

            if( codePoint < 0x10000 ) {
                accum += 2;
                continue;
            }

            if( codePoint < 0x1000000 ) {
                accum += 3;
            } else {
                accum += 4;
            }
        }

        return accum * 2;
    }

Examples:

getStringMemorySize( 'I'    );     //  2
getStringMemorySize( '❤'    );     //  4
getStringMemorySize( ''   );     //  8
getStringMemorySize( 'I❤' );     // 14
Mac
  • 1,432
  • 21
  • 27
4

The answer from Lauri Oherd works well for most strings seen in the wild, but will fail if the string contains lone characters in the surrogate pair range, 0xD800 to 0xDFFF. E.g.

byteCount(String.fromCharCode(55555))
// URIError: URI malformed

This longer function should handle all strings:

function bytes (str) {
  var bytes=0, len=str.length, codePoint, next, i;

  for (i=0; i < len; i++) {
    codePoint = str.charCodeAt(i);

    // Lone surrogates cannot be passed to encodeURI
    if (codePoint >= 0xD800 && codePoint < 0xE000) {
      if (codePoint < 0xDC00 && i + 1 < len) {
        next = str.charCodeAt(i + 1);

        if (next >= 0xDC00 && next < 0xE000) {
          bytes += 4;
          i++;
          continue;
        }
      }
    }

    bytes += (codePoint < 0x80 ? 1 : (codePoint < 0x800 ? 2 : 3));
  }

  return bytes;
}

E.g.

bytes(String.fromCharCode(55555))
// 3

It will correctly calculate the size for strings containing surrogate pairs:

bytes(String.fromCharCode(55555, 57000))
// 4 (not 6)

The results can be compared with Node's built-in function Buffer.byteLength:

Buffer.byteLength(String.fromCharCode(55555), 'utf8')
// 3

Buffer.byteLength(String.fromCharCode(55555, 57000), 'utf8')
// 4 (not 6)
Prem
  • 15,911
  • 11
  • 31
  • 35
4

A single element in a JavaScript String is considered to be a single UTF-16 code unit. That is to say, Strings characters are stored in 16-bit (1 code unit), and 16-bit is equal to 2 bytes (8-bit = 1 byte).

The charCodeAt() method can be used to return an integer between 0 and 65535 representing the UTF-16 code unit at the given index.

The codePointAt() can be used to return the entire code point value for Unicode characters, e.g. UTF-32.

When a UTF-16 character can't be represented in a single 16-bit code unit, it will have a surrogate pair and therefore use two code units( 2 x 16-bit = 4 bytes)

See Unicode encodings for different encodings and their code ranges.

holmberd
  • 2,393
  • 26
  • 30
  • What you say about surrogates would seem to violate the ECMA script spec. As I commented above, the spec requires two bytes per character, and allowing surrogate pairs would violate this. – whitneyland Oct 13 '17 at 16:36
  • Javascript ES5 engines are internally free to use USC-2 or UTF-16, but what it is actually using is sort of UCS-2 with surrogates. That is because it allows exposing surrogate halves as separate characters, single UTF-16 unsigned integers. If you use a unicode character in your source code that needs more than a single 16-bit code unit to be represented, a surrogate pair will be used. This behaviour is not in violating with the specs, see chapter 6 source text: https://www.ecma-international.org/ecma-262/5.1/ – holmberd Oct 14 '17 at 16:44
4

The Blob interface's size property returns the size of the Blob or File in bytes.

const getStringSize = (s) => new Blob([s]).size;
sooraj
  • 324
  • 1
  • 8
1

I'm working with an embedded version of the V8 Engine. I've tested a single string. Pushing each step 1000 characters. UTF-8.

First test with single byte (8bit, ANSI) Character "A" (hex: 41). Second test with two byte character (16bit) "Ω" (hex: CE A9) and the third test with three byte character (24bit) "☺" (hex: E2 98 BA).

In all three cases the device prints out of memory at 888 000 characters and using ca. 26 348 kb in RAM.

Result: The characters are not dynamically stored. And not with only 16bit. - Ok, perhaps only for my case (Embedded 128 MB RAM Device, V8 Engine C++/QT) - The character encoding has nothing to do with the size in ram of the javascript engine. E.g. encodingURI, etc. is only useful for highlevel data transmission and storage.

Embedded or not, fact is that the characters are not only stored in 16bit. Unfortunally I've no 100% answer, what Javascript do at low level area. Btw. I've tested the same (first test above) with an array of character "A". Pushed 1000 items every step. (Exactly the same test. Just replaced string to array) And the system bringt out of memory (wanted) after 10 416 KB using and array length of 1 337 000. So, the javascript engine is not simple restricted. It's a kind more complex.

Dominik
  • 429
  • 1
  • 7
  • 14
0

You can try this:

  var b = str.match(/[^\x00-\xff]/g);
  return (str.length + (!b ? 0: b.length)); 

It worked for me.

Lucifer
  • 29,392
  • 25
  • 90
  • 143
  • 1
    Surely this assumes that all character are maximum 2 bytes? If there are 3 or 4 byte characters (which are possible in UTF-8) then this function will only count them as 2-byte characters? – Adam Burley May 11 '15 at 17:38