4

Possible Duplicate:
How many bytes in a JavaScript string?
String length in bytes in JavaScript

How can I calculate how many bits in a String? Actually what I need is how many octets (8-bit bytes) in a JavaScript(V8) String? If it's impossible to know, is there any other characters data structure that can be helpful here instead of String?

UPDATE: for UTF-8 encoding

Community
  • 1
  • 1
user1109648
  • 41
  • 1
  • 4
  • 2
    what exactly are you trying to accomplish? – Sergio Tulentsev Dec 21 '11 at 10:38
  • 2
    i would like to send it back to a browser as an http response body, I need to know the content length, I don't want to use the 'http' module. – user1109648 Dec 21 '11 at 11:01
  • Depends on the charset and the encoding. If it's ASCII, transferred as ASCII, then one byte per char. If it's Unicode transferred as UTF-8 then... you will need to do some computations! – Edgar Bonet Dec 21 '11 at 11:46
  • can I know what is the charset and the encoding of a specific String?(NodeJS) – user1109648 Dec 21 '11 at 12:23
  • 1
    I don't know about node.js, but in principle a JS string has no encoding per se (well, actually it's handled as UTF-16 internally, but that's probably irrelevant). You need to _choose_ an encoding when serializing the string as a stream of octets. And you need to _tell the browser_ about the encoding you have chosen, typically with an appropriate HTTP header. – Edgar Bonet Dec 21 '11 at 12:51
  • 1
    I see, so when I write a String to a socket, the default is UTF-8, how can I calculate the number of octets for this encoding? – user1109648 Dec 21 '11 at 13:13
  • There's a very short and nice solution for NodeJS. Take a look at https://stackoverflow.com/a/46321139/1852787 – Iván Pérez Jul 17 '19 at 15:43

1 Answers1

3

Assuming you are only using BMP characters:

/* Compute length of UTF-8 serialization of string s. */
function utf8Length(s)
{
    var l = 0;
    for (var i = 0; i < s.length; i++) {
        var c = s.charCodeAt(i);
        if (c <= 0x007f) l += 1;
        else if (c <= 0x07ff) l += 2;
        else if (c >= 0xd800 && c <= 0xdfff)  l += 2;  // surrogates
        else l += 3;
    }
    return l;
}

If you get out of BMP (i.e. use characters above 0xffff) things get more complicated, as they will be seen in JavaScript as surrogate pairs that you will have to identify...

Update: I updated the code so that it works with all of Unicode, not only BMP. However, this code now relies on a strong assumption: that the given string is correct UTF-16. It works by counting two bytes for every surrogate found in the string. The truth is that a surrogate pair is encoded as 4 bytes in UTF-8, and no surrogate should ever be found outside a pair.

Edgar Bonet
  • 3,416
  • 15
  • 18
  • Can you explain what is '0x007f'?what does it represent? – user1109648 Dec 21 '11 at 14:03
  • 0x007f is 127 in hex: it's the upper limit of ASCII and the highest Unicode codepoint encoded as a single byte in UTF-8. 0x07ff is the highest codepoint encoded as two bytes. See [Wikipedia:UTF-8](http://en.wikipedia.org/wiki/Utf8). – Edgar Bonet Dec 21 '11 at 14:28