How to calculate byte length containing UTF8 characters using javascript?

Question

I have textbox, in which the user can enter the characters in ASCII/UTF-8 or a combination of both. Is there any API in javascript which we can calculate the length of string in bytes for the characters entered in textbox.

Like if i enter ascii chacter let's say : mystring - the length would be calculated as 8. But when UTF8 characters are entered the characters can be 2/3/4 byte.

lets say the character entered : i ♥ u , the length in bytes is 5.

The textbox can accept max length of 31 characters. But in case if UTF8 characters entered, it will not accept character string : i ♥ u i ♥ u i ♥ u i ♥ u i ♥ u . the length is 30.

Can we restrict the user to enter characters not more than 31 even for UTF8 characters.

score 27 · Answer 1 · edited Apr 28 '18 at 10:54

27

As of 2018, the most compatible and reliable way of doing this seems to be with the blob api.

new Blob([str]).size

Even supported in IE10 if anyone uses that anymore.

edited Apr 28 '18 at 10:54

ayanami

1,588
13
20

answered Apr 01 '18 at 15:39

recursive

83,943
34
151
241

score 5 · Answer 2 · answered Feb 06 '18 at 11:58

The experimental TextEncoder API can be used for this but is not supported by Internet Explorer or Safari:

(new TextEncoder()).encode("i ♥ u i ♥ u i ♥ u i ♥ u i ♥ u").length;

Another alternative is to URI-encode the string and count characters and %-encoded escape sequences, as in this library:

~-encodeURI("i ♥ u i ♥ u i ♥ u i ♥ u i ♥ u").split(/%..|./).length

The github page has a compatibility list which unfortunately does not include IE10, but IE9.

Since I can not yet comment, I'll also note here that the solution in the accepted answer does not work for code points consisting of multiple UTF-16 code units.

score 4 · Accepted Answer · answered Sep 23 '14 at 11:52

4

Counting UTF8 bytes comes up quite a bit in JavaScript, a bit of looking around and you'll find a number of libraries (here's one example: https://github.com/mathiasbynens/utf8.js) that can help. I also found a thread (https://gist.github.com/mathiasbynens/1010324) full of solutions specifically for utf8 byte counts.

Here is the smallest, and most accurate function out of that thread:

function countUtf8Bytes(s){
    var b = 0, i = 0, c
    for(;c=s.charCodeAt(i++);b+=c>>11?3:c>>7?2:1);
    return b
}

Note: I rearranged it a bit so that the signature is easier to read. However its still a very compact function that might be hard to understand for some.

You can check its results with this tool: https://mothereff.in/byte-counter

One correction to your OP, the example string you provided i ♥ u is actually 7 bytes, this function does count it correctly.

answered Sep 23 '14 at 11:52

klyd

3,939
3
24
34

2

This code seems to give 6 for "", but the correct result is 4 I think. – recursive Apr 01 '18 at 15:37
1

@recursive because charCodeAt returns 0-65535 only. code points not in BMP(for example, emojis) will be represented as UTF-16 surrogate pairs. see https://stackoverflow.com/questions/48419167/how-to-convert-one-emoji-character-to-unicode-codepoint-number-in-javascript for details. – Roy Aug 28 '19 at 07:06
https://stackoverflow.com/questions/5728045/c-most-efficient-way-to-determine-how-many-bytes-will-be-needed-for-a-utf-16-st – hippietrail Nov 08 '19 at 14:21

How to calculate byte length containing UTF8 characters using javascript?

3 Answers3

Linked