1

I'm reading the source code of a fellow programmer in Javascript. I'm wondering why the coder have used a complex function to calculate the string length instead of just using the .length prototype's method?

Here the original script. Here its function's snippet:

function byteLength(str) {
      // returns the byte length of an utf8 string
      var s = str.length;
      for (var i=str.length-1; i>=0; i--) {
        var code = str.charCodeAt(i);
        if (code > 0x7f && code <= 0x7ff) s++;
        else if (code > 0x7ff && code <= 0xffff) s+=2;
        if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
      }
      return s;
    }
Webwoman
  • 10,196
  • 12
  • 43
  • 87
  • 2
    I believe there may be something happening there to account for potential differences in character count vs bytes consumed. I'm not entirely sure and would ask said colleague, but I do believe there are some characters that are effectively combinations of multiple characters and would result in the string character count (string.length) and the outcome of this function to be different. Again, I'm not sure and would still ask the colleague - but that's my best guess. – CodyKnapp Mar 29 '19 at 23:30
  • 2
    The function calculates the length in bytes. `.length` gives the length in characters. Some characters may be multi-byte. – lurker Mar 29 '19 at 23:30
  • 1
    The code comment starts off by misleading you. A string is a counted sequence of UTF-16 code units. It does correctly compute the intended byte length, though. Maybe you can get the code comment improved. – Tom Blodget Mar 30 '19 at 15:56

2 Answers2

2

JavaScript represents strings as UTF-16 sequences. That code you posted figures out how long a JavaScript string would be if it were represented as a sequence of UTF-8 codes. UTF-8 and UTF-16 are two different ways of representing Unicode, and they're similar but not the same.

Characters basically in the old "ASCII" range (Latin 1) are one byte in UTF-8. There are about 60000 other characters that take two bytes to be represented in UTF-8, and then many more that are represented with three-byte clumps.

UTF-16 represents those longer code groups with two 16-bit characters, called "surrogate pairs".

Note that depending on what one means by "character" things are more complicated than that, as there are Unicode "characters" that serve more like diacritical marks in concert with a base character and possibly other such modifiers.

The bottom line is that .length gives you only the number of UTF-16 codes in a string. If there are characters that require a surrogate pair in the string, then the number of actual characters is less than .length. Furthermore, going by the name of your function, the number of UTF-8 bytes is almost always less than the number of UTF-16 bytes (maybe always less but I'm conservative so I won't make that claim; seems true though).

Pointy
  • 405,095
  • 59
  • 585
  • 614
2

That code is really just going through and validating that the string is UTF8 and only counts those characters. Run this example and it should give you a good idea of how non-uft-8 characters can be counted a little different (as far as string length goes)

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

console.log('length should be 4: Length is ', byteLength('test'));
console.log('length should be 4: Length is ', 'test'.length);

console.log('length should be 5: Length is ', byteLength('test '));
console.log('length should be 5: Length is ', 'test '.length);

console.log('length should be 3: Length is ', byteLength('Ḇ'));
console.log('length should be 3: Length is ', 'Ḇ'.length);

Here is a good article that provides some good/bad test data if you wanted to play around further:

Really Good, Bad UTF-8 example test data

mwilson
  • 12,295
  • 7
  • 55
  • 95
  • 1
    There is no such thing as a UTF-8 character and there really is no such thing as a non-UTF-8 character. UTF-8 is a character encoding for the Unicode character set. It can encode all characters in the Unicode character set. A character encoding is a mapping between codepoints and code units (which are often then serialized to byte sequences). The idea of a "character" is at least as large as codepoint, if not larger. See [grapheme cluster](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries). – Tom Blodget Mar 30 '19 at 15:31
  • Why are we adding s+=2 – Syed Shahjahan Jun 22 '20 at 10:23