181

In my JavaScript code I need to compose a message to server in this format:

<size in bytes>CRLF
<data>CRLF

Example:

3
foo

The data may contain unicode characters. I need to send them as UTF-8.

I'm looking for the most cross-browser way to calculate the length of the string in bytes in JavaScript.

I've tried this to compose my payload:

return unescape(encodeURIComponent(str)).length + "\n" + str + "\n"

But it does not give me accurate results for the older browsers (or, maybe the strings in those browsers in UTF-16?).

Any clues?

Update:

Example: length in bytes of the string ЭЭХ! Naïve? in UTF-8 is 15 bytes, but some browsers report 23 bytes instead.

Alexander Gladysh
  • 39,865
  • 32
  • 103
  • 160
  • 1
    Possible duplicate? http://stackoverflow.com/questions/2219526/how-many-bytes-in-a-javascript-string – Eli Apr 01 '11 at 16:03
  • @Eli: none of the answers in the question you've linked to work for me. – Alexander Gladysh Apr 01 '11 at 16:14
  • When you talk about "ЭЭХ! Naïve?" have you put it into a particular normal form? http://unicode.org/reports/tr15/ – Mike Samuel Apr 01 '11 at 16:20
  • @Mike: I typed it in the random text editor (in UTF-8 mode) and saved it. Just as any user of my library would do. However, it seems that I figured out what was wrong — see my answer. – Alexander Gladysh Apr 01 '11 at 16:29

17 Answers17

238

Years passed and nowadays you can do it natively

(new TextEncoder().encode('foo')).length

Note that it's not supported by IE (you may use a polyfill for that).

MDN documentation

Standard specifications

Riccardo Galli
  • 12,419
  • 6
  • 64
  • 62
  • Notice that according to the [MDN documentation](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder), the TextEncoder is not supported yet by Safari (WebKit). – Maor Oct 15 '17 at 14:08
  • `TextEncode` supports only *utf-8* since Chrome 53. – Jehong Ahn Apr 09 '18 at 06:23
  • 3
    If you only need the length, it might be overkill to allocate a new string, do the actual conversion, take the length, and then discard the string. See my answer above for a function that just computes the length in an efficient manner. – lovasoa Sep 09 '19 at 09:50
107

There is no way to do it in JavaScript natively. (See Riccardo Galli's answer for a modern approach.)


For historical reference or where TextEncoder APIs are still unavailable.

If you know the character encoding, you can calculate it yourself though.

encodeURIComponent assumes UTF-8 as the character encoding, so if you need that encoding, you can do,

function lengthInUtf8Bytes(str) {
  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
  var m = encodeURIComponent(str).match(/%[89ABab]/g);
  return str.length + (m ? m.length : 0);
}

This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.

The table in wikipedia makes it clearer

Bits        Last code point Byte 1          Byte 2          Byte 3
  7         U+007F          0xxxxxxx
 11         U+07FF          110xxxxx        10xxxxxx
 16         U+FFFF          1110xxxx        10xxxxxx        10xxxxxx
...

If instead you need to understand the page encoding, you can use this trick:

function lengthInPageEncoding(s) {
  var a = document.createElement('A');
  a.href = '#' + s;
  var sEncoded = a.href;
  sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
  var m = sEncoded.match(/%[0-9a-f]{2}/g);
  return sEncoded.length - (m ? m.length * 2 : 0);
}
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Well, how would I know the character encoding of the data? I need to encode whatever string user (programmer) supplied to my JS library. – Alexander Gladysh Apr 01 '11 at 16:15
  • @Alexander, when you're sending the message to the server, are you specifying the content-encoding of the message body via an HTTP header? – Mike Samuel Apr 01 '11 at 16:20
  • @Mike: Well, I'm coding a JS library. I do not know how user would specify the content-encoding. However, I can put this in a requirements. What is the best way to do spell this out in readme? – Alexander Gladysh Apr 01 '11 at 16:27
  • @Alexander, unless you can decouple the encoding they use from your code, you have to either require that they specify the encoding you assume, or you require that they send a message with the page encoding. Will edit my post to make it clear how to test the page encoding. – Mike Samuel Apr 01 '11 at 16:30
  • @Mike: so, you're saying that `unescape(encodeURIComponent(str)).length` would work *only* if `str` is in UTF-8? – Alexander Gladysh Apr 01 '11 at 16:37
  • @Mike: I see your update, but I'm confused by the function name. `lengthInPageEncoding` suggests that it is a length in characters. Is that correct? I need the length in *bytes*. (Sorry for a stupid question.) – Alexander Gladysh Apr 01 '11 at 16:39
  • `unescape(encodeURIComponent(str)).length` won't do anything useful since `unescape` does different things on different platforms. `decodeURIComponent(encodeURIComponent(str)).length` will only give you `str.length`. See http://xkr.us/articles/javascript/encode-compare/ – Mike Samuel Apr 01 '11 at 16:45
  • @Mike: I see. So, you suggest that the proper way to get string length in bytes in my case is to use your `lengthInPageEncoding` function, is this correct? – Alexander Gladysh Apr 01 '11 at 16:49
  • @Alexander, if you can possibly get away without worrying about the length in bytes, I would do that. But if you really need to, something like that is probably your best bet. It probably won't work for strange encodings like UTF-7 since + is a special character in URIs and requires a multi-byte encoding in UTF-7. – Mike Samuel Apr 01 '11 at 16:52
  • @Mike: I can't get away without it, since server is so stupid that it does not know anything about Unicode — it treats all strings as binary blobs (and it does not need to know more to work). I do not worry about UTF-7. Actually I'm fine with enforcing UTF-8 (but would like to support UTF-16 and CP1251 and such as well). – Alexander Gladysh Apr 01 '11 at 16:56
  • @Mike: about `unescape`: the strange thing is that all browsers on http://browsershots.org/ display the correct string size — for the UTF-8 string that I tested at least (see url in my answer). Is this a fluke? – Alexander Gladysh Apr 01 '11 at 17:54
  • I'm accepting this answer, but in the end I decided to extend the protocol to support UTF-8 strings natively. Apparently it is not that scary: http://stackoverflow.com/questions/5517205/how-to-read-utf-8-string-given-its-length-in-characters-in-plain-c89 – Alexander Gladysh Apr 01 '11 at 18:57
  • 1
    @Alexander, cool. If you're establishing a protocol, mandating UTF-8 is a great idea for text-interchange. One less variable that can result in a mismatch. UTF-8 should be the network-byte-order of character encodings. – Mike Samuel Apr 01 '11 at 21:34
  • 4
    @MikeSamuel: The `lengthInUtf8Bytes` function returns 5 for non-BMP characters as `str.length` for these returns 2. I'll write a modified version of this function to answers section. – Lauri Oherd Aug 30 '12 at 18:56
  • @LauriOherd, I think you're right. Should the output use a 5 or 6 byte encoding for supplemental CPs instead of individually encoding the UTF-16 code-units? – Mike Samuel Aug 30 '12 at 19:02
  • @MikeSamuel, Surrogate pairs consist of two code units (`''.length == 2`) even though there’s only [one Unicode character](http://codepoints.net/U+1D306) there. The individual surrogate halves are being exposed as if they were characters: `'' == '\uD834\uDF06'`. [Source](http://mathiasbynens.be/notes/javascript-encoding) – Lauri Oherd Aug 31 '12 at 05:44
  • @Lauri, I know. By "UTF-16 code-units", I was referring to surrogates. – Mike Samuel Aug 31 '12 at 07:05
  • 4
    This solution is cool but utf8mb4 is not considered. For example , `encodeURIComponent('')` is `'%F0%9F%8D%80'`. – albert Jun 22 '16 at 17:01
  • @MikeSamuel In [a previous comment](https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript#comment6264163_5515960) you said "since `unescape` does different things on different platforms". On which platform does `unescape` not follow the [standard](https://www.ecma-international.org/ecma-262/10.0/index.html#sec-unescape-string)? (The behavior of `unescape` in the current standard is the same as in the first [standard](https://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262,%201st%20edition,%20June%201997.pdf)) – T S Aug 09 '19 at 15:52
  • @TS You may be right in that there's no modern engines that diverge. `unescape('%uabcd') ==== '\uabcd'` which is not percent encoding. IIRC, that was an IE change from Netscape behavior to make it easier for JS on non-UTF8 pages to interoperate with IIS. See ["%u encoding"](https://www.cgisecurity.com/lib/URLEmbeddedAttacks.html) re IIS quirks. – Mike Samuel Aug 12 '19 at 17:15
  • @MikeSamuel Yes, `%uabcd` is not percent encoding (It's not even valid in an URI according to [RFC3986](https://tools.ietf.org/html/rfc3986)) - but this never happens in `unescape(encodeURIComponent(...))`. So I still believe that "`unescape(encodeURIComponent(str)).length` won't do anything useful since `unescape` does different things on different platforms." isn't correct. Where does `unescape(encodeURIComponent(str)).length` not work? (See also https://stackoverflow.com/a/619428/2770331) – T S Aug 13 '19 at 20:46
  • @TS, I think I've already acknowledged that "does different things" is false today. You seem to be talking about an old thread which I'd forgotten, so I'm not clear on what you want to do with `unescape(encodeURIComponent(str)).length`. – Mike Samuel Aug 14 '19 at 21:13
  • I just think it's not good, that the accepted, highest voted answer suggests a method (`var m = encodeURIComponent(str).match(/%[89ABab]/g); return str.length + (m ? m.length : 0);`) that doesn't work for codepoints outside the BMP (as noticed in this [comment](https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript/5515960?noredirect=1#comment16344660_5515960)) when the original question already contained a solution, that works everywhere with all codepoints (`return unescape(encodeURIComponent(str)).length`). It would be nice, if that was not only visible in the ... – T S Aug 15 '19 at 10:11
  • ... comments, but also in the answer itself - who reads this many comments? Oh and sorry, if I seem a little rude - that wasn't my intention. I'm just not a native speaker and have sometimes the habit, to be a little to direct... – T S Aug 15 '19 at 10:14
  • Oh, and the quote "`unescape(encodeURIComponent(str)).length` won't do anything useful since `unescape` does different things on different platforms." came from [here](https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript/5515960?noredirect=1#comment6264163_5515960), if that was your question :-) – T S Aug 15 '19 at 10:25
  • This solution showed my 8 byte as 24.try : عباس in utf8 – Steve Moretz Sep 09 '19 at 06:25
  • @stevemoretz, I get 8 for `"\u0639\u0628\u0627\u0633"`. – Mike Samuel Sep 09 '19 at 15:10
  • I don't know sir,I'm using opera,html utf-8 page,عباس gives me 24. – Steve Moretz Sep 09 '19 at 15:19
  • In your developer console, what do you get for `encodeURIComponent("\u0639\u0628\u0627\u0633")`? I get `"%D8%B9%D8%A8%D8%A7%D8%B3"`. – Mike Samuel Sep 09 '19 at 16:08
88

Here is a much faster version, which doesn't use regular expressions, nor encodeURIComponent():

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

Here is a performance comparison.

It just computes the length in UTF8 of each unicode codepoints returned by charCodeAt() (based on wikipedia's descriptions of UTF8, and UTF16 surrogate characters).

It follows RFC3629 (where UTF-8 characters are at most 4-bytes long).

user1063287
  • 10,265
  • 25
  • 122
  • 218
lovasoa
  • 6,419
  • 1
  • 35
  • 45
86

For simple UTF-8 encoding, with slightly better compatibility than TextEncoder, Blob does the trick. Won't work in very old browsers though.

new Blob([""]).size; // -> 4  
simap
  • 1,668
  • 1
  • 14
  • 9
  • 1
    This is even better than TextEncoder and needs to be actual answer. No polyfill required. – jprado Mar 04 '21 at 21:17
  • Has the same downside as the `Buffer` approach below: it won't work for both browsers and Node. So the TextEncoder solution is preferable in library code, that might be used in either place. (Though I see Node does now have experimental support for Blob: https://nodejs.org/api/all.html#all_buffer_class-blob ) – Darren Cook Jan 31 '22 at 21:19
  • Works in Node 17. Also works in Web Workers. Plus it accepts a `File`. – vhs May 08 '22 at 16:37
37

Another very simple approach using Buffer (only for NodeJS):

Buffer.byteLength(string, 'utf8')

Buffer.from(string).length
Iván Pérez
  • 2,278
  • 1
  • 24
  • 49
31

This function will return the byte size of any UTF-8 string you pass to it.

function byteCount(s) {
    return encodeURI(s).split(/%..|./).length - 1;
}

Source

Lauri Oherd
  • 1,383
  • 1
  • 12
  • 14
11

I compared some of the methods suggested here in Firefox for speed.

The string I used contained the following characters: œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤

All results are averages of 3 runs each. Times are in milliseconds. Note that all URIEncoding methods behaved similarly and had extreme results, so I only included one.

While there are some fluctuations based on the size of the string, the charCode methods (lovasoa and fuweichin) both perform similarly and the fastest overall, with fuweichin's charCode method the fastest. The Blob and TextEncoder methods performed similarly to each other. Generally the charCode methods were about 75% faster than the Blob and TextEncoder methods. The URIEncoding method was basically unacceptable.

Here are the results I got:

Size 6.4 * 10^6 bytes:

Lauri Oherd – URIEncoding:     6400000    et: 796
lovasoa – charCode:            6400000    et: 15
fuweichin – charCode2:         6400000    et: 16
simap – Blob:                  6400000    et: 26
Riccardo Galli – TextEncoder:  6400000    et: 23

Size 19.2 * 10^6 bytes: Blob does kind of a weird thing here.

Lauri Oherd – URIEncoding:     19200000    et: 2322
lovasoa – charCode:            19200000    et: 42
fuweichin – charCode2:         19200000    et: 45
simap – Blob:                  19200000    et: 169
Riccardo Galli – TextEncoder:  19200000    et: 70

Size 64 * 10^6 bytes:

Lauri Oherd – URIEncoding:     64000000    et: 12565
lovasoa – charCode:            64000000    et: 138
fuweichin – charCode2:         64000000    et: 133
simap – Blob:                  64000000    et: 231
Riccardo Galli – TextEncoder:  64000000    et: 211

Size 192 * 10^6 bytes: URIEncoding methods freezes browser at this point.

lovasoa – charCode:            192000000    et: 754
fuweichin – charCode2:         192000000    et: 480
simap – Blob:                  192000000    et: 701
Riccardo Galli – TextEncoder:  192000000    et: 654

Size 640 * 10^6 bytes:

lovasoa – charCode:            640000000    et: 2417
fuweichin – charCode2:         640000000    et: 1602
simap – Blob:                  640000000    et: 2492
Riccardo Galli – TextEncoder:  640000000    et: 2338

Size 1280 * 10^6 bytes: Blob & TextEncoder methods are starting to hit the wall here.

lovasoa – charCode:            1280000000    et: 4780
fuweichin – charCode2:         1280000000    et: 3177
simap – Blob:                  1280000000    et: 6588
Riccardo Galli – TextEncoder:  1280000000    et: 5074

Size 1920 * 10^6 bytes:

lovasoa – charCode:            1920000000    et: 7465
fuweichin – charCode2:         1920000000    et: 4968
JavaScript error: file:///Users/xxx/Desktop/test.html, line 74: NS_ERROR_OUT_OF_MEMORY:

Here is the code:

function byteLengthURIEncoding(str) {
  return encodeURI(str).split(/%..|./).length - 1;
}

function byteLengthCharCode(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

function byteLengthCharCode2(s){
  //assuming the String is UCS-2(aka UTF-16) encoded
  var n=0;
  for(var i=0,l=s.length; i<l; i++){
    var hi=s.charCodeAt(i);
    if(hi<0x0080){ //[0x0000, 0x007F]
      n+=1;
    }else if(hi<0x0800){ //[0x0080, 0x07FF]
      n+=2;
    }else if(hi<0xD800){ //[0x0800, 0xD7FF]
      n+=3;
    }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
      var lo=s.charCodeAt(++i);
      if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
        n+=4;
      }else{
        throw new Error("UCS-2 String malformed");
      }
    }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
      throw new Error("UCS-2 String malformed");
    }else{ //[0xE000, 0xFFFF]
      n+=3;
    }
  }
  return n;
}

function byteLengthBlob(str) {
  return new Blob([str]).size;
}

function byteLengthTE(str) {
  return (new TextEncoder().encode(str)).length;
}

var sample = "œ´®†¥¨ˆøπ¬˚∆˙©ƒ∂ßåΩ≈ç√∫˜µ≤i";
var string = "";

// Adjust multiplier to change length of string.
let mult = 1000000;

for (var i = 0; i < mult; i++) {
  string += sample;
}

let t0;

try {
  t0 = Date.now();
  console.log("Lauri Oherd – URIEncoding:   " + byteLengthURIEncoding(string) + "    et: " + (Date.now() - t0));
} catch(e) {}

t0 = Date.now();
console.log("lovasoa – charCode:            " + byteLengthCharCode(string) + "    et: " + (Date.now() - t0));

t0 = Date.now();
console.log("fuweichin – charCode2:         " + byteLengthCharCode2(string) + "    et: " + (Date.now() - t0));

t0 = Date.now();
console.log("simap – Blob:                  " + byteLengthBlob(string) + "    et: " + (Date.now() - t0));

t0 = Date.now();
console.log("Riccardo Galli – TextEncoder:  " + byteLengthTE(string) + "    et: " + (Date.now() - t0));
Keith
  • 22,005
  • 2
  • 27
  • 44
KevinHJ
  • 1,014
  • 11
  • 24
7

Took me a while to find a solution for React Native so I'll put it here:

First install the buffer package:

npm install --save buffer

Then user the node method:

const { Buffer } = require('buffer');
const length = Buffer.byteLength(string, 'utf-8');
laurent
  • 88,262
  • 77
  • 290
  • 428
5

Actually, I figured out what's wrong. For the code to work the page <head> should have this tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Or, as suggested in comments, if server sends HTTP Content-Encoding header, it should work as well.

Then results from different browsers are consistent.

Here is an example:

<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
  <title>mini string length test</title>
</head>
<body>

<script type="text/javascript">
document.write('<div style="font-size:100px">' 
    + (unescape(encodeURIComponent("ЭЭХ! Naïve?")).length) + '</div>'
  );
</script>
</body>
</html>

Note: I suspect that specifying any (accurate) encoding would fix the encoding problem. It is just a coincidence that I need UTF-8.

Alexander Gladysh
  • 39,865
  • 32
  • 103
  • 160
  • 2
    The `unescape` JavaScript function [should not](http://msdn.microsoft.com/en-us/library/dz4x90hk(v=vs.94).aspx) be used to decode Uniform Resource Identifiers (URI). – Lauri Oherd Aug 31 '12 at 05:58
  • 3
    @LauriOherd `unescape` should indeed never be used to decode URIs. However, to convert text to UTF-8 it works [fine](https://stackoverflow.com/questions/2219526/how-many-bytes-in-a-javascript-string#comment101351898_2858850) – T S Aug 09 '19 at 23:45
  • `unescape(encodeURIComponent(...)).length` always calculates the correct length with or without `meta http-equiv ... utf8`. Without an encoding specification some browsers might simply had a *different text* (after encoding the bytes of the document into actual html text) whose length they calculated. One could test this easily, by printing not only the length, but also the text itself. – T S Aug 09 '19 at 23:53
  • @LauriOherd Yes, and it's not used to decode a URI in this example. – Finesse Feb 18 '21 at 03:37
5

In NodeJS, Buffer.byteLength is a method specifically for this purpose:

let strLengthInBytes = Buffer.byteLength(str); // str is UTF-8

Note that by default the method assumes the string is in UTF-8 encoding. If a different encoding is required, pass it as the second argument.

Boaz
  • 19,892
  • 8
  • 62
  • 70
  • Is it possible to calculate `strLengthInBytes` just by knowing the 'count' of characters within the string? ie `var text = "Hello World!; var text_length = text.length; // pass text_length as argument to some method?`. And, just for reference, re `Buffer` - I just came across [this answer](https://stackoverflow.com/a/52254083) that discusses `new Blob(['test string']).size` and, in node, `Buffer.from('test string').length`. Maybe these will help some people too? – user1063287 Jul 15 '19 at 09:33
  • 1
    @user1063287 The problem is the number of characters is not always equivalent to the number of bytes. For example, the common UTF-8 encoding is a variable width encoding, in which a single character may be 1 byte to 4 bytes in size. That’s why a special method is needed as well as the encoding used. – Boaz Jul 15 '19 at 10:17
  • For example, a UTF-8 string with 4 characters, may at least be 4 bytes "long", if each character is just 1 byte; and at most 16 bytes "long" if each character is 4 bytes. Note in either case the _characters count_ is still 4 and is therefore an unreliable measure for the _bytes length_. – Boaz Jun 08 '20 at 16:50
4

Here is an independent and efficient method to count UTF-8 bytes of a string.

//count UTF-8 bytes of a string
function byteLengthOf(s){
 //assuming the String is UCS-2(aka UTF-16) encoded
 var n=0;
 for(var i=0,l=s.length; i<l; i++){
  var hi=s.charCodeAt(i);
  if(hi<0x0080){ //[0x0000, 0x007F]
   n+=1;
  }else if(hi<0x0800){ //[0x0080, 0x07FF]
   n+=2;
  }else if(hi<0xD800){ //[0x0800, 0xD7FF]
   n+=3;
  }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
   var lo=s.charCodeAt(++i);
   if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
    n+=4;
   }else{
    throw new Error("UCS-2 String malformed");
   }
  }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
   throw new Error("UCS-2 String malformed");
  }else{ //[0xE000, 0xFFFF]
   n+=3;
  }
 }
 return n;
}

var s="\u0000\u007F\u07FF\uD7FF\uDBFF\uDFFF\uFFFF";
console.log("expect byteLengthOf(s) to be 14, actually it is %s.",byteLengthOf(s));

Note that the method may throw error if an input string is UCS-2 malformed

fuweichin
  • 1,398
  • 13
  • 14
2

Based on the following benchmarks, this appears to be the fastest choice that works on all platforms:

I have created the following library that implements the above:

import stringByteLength from 'string-byte-length'

stringByteLength('test') // 4
stringByteLength(' ') // 1
stringByteLength('\0') // 1
stringByteLength('±') // 2
stringByteLength('★') // 3
stringByteLength('') // 4
ehmicky
  • 1,915
  • 4
  • 20
  • 29
1

This would work for BMP and SIP/SMP characters.

    String.prototype.lengthInUtf8 = function() {
        var asciiLength = this.match(/[\u0000-\u007f]/g) ? this.match(/[\u0000-\u007f]/g).length : 0;
        var multiByteLength = encodeURI(this.replace(/[\u0000-\u007f]/g)).match(/%/g) ? encodeURI(this.replace(/[\u0000-\u007f]/g, '')).match(/%/g).length : 0;
        return asciiLength + multiByteLength;
    }

    'test'.lengthInUtf8();
    // returns 4
    '\u{2f894}'.lengthInUtf8();
    // returns 4
    'سلام علیکم'.lengthInUtf8();
    // returns 19, each Arabic/Persian alphabet character takes 2 bytes. 
    '你好,JavaScript 世界'.lengthInUtf8();
    // returns 26, each Chinese character/punctuation takes 3 bytes. 
chrislau
  • 31
  • 1
0

You can try this:

function getLengthInBytes(str) {
  var b = str.match(/[^\x00-\xff]/g);
  return (str.length + (!b ? 0: b.length)); 
}

It works for me.

anh tran
  • 159
  • 2
  • 13
  • returns 1 for "â" in chrome – Rick Mar 29 '17 at 12:09
  • the first issue could be fixed by changing \xff to \x7f, but that doesn't fix the fact that codepoints between 0x800-0xFFFF will be reported as taking 2 bytes, when they take 3. – Rick Mar 29 '17 at 12:16
0

Ran into this How to emulate mb_strlen in javascript with strings containing HTML

where the string was not a good match for earlier answers.

I got the expected length of 8 here:

const str = 'X&nbsp;&#34;FUEL&#34;'
const div = document.createElement("div");
div.innerHTML = str
console.log(div.textContent.length)
mplungjan
  • 169,008
  • 28
  • 173
  • 236
0

I'm checking the length with :

const str = "as%20"
const len = new URL(str.replace(/%[A-F0-9]{2}/g, "..."), "https:$").pathname.replace(/%[A-F0-9]{2}/g, "-").length - 1
console.log(len) // 13

when i (try to) verify if a directory name is less than 180 characters

0

sizeInBytes = Buffer.from(data).length

Example:

let data = 'šč'; // data with utf-8 characters
console.log( data.length ); // 2
console.log( Buffer.from(data).length ); // 4
Zoran
  • 21
  • 1
  • 2