Extract substring by utf-8 byte positions

Question

I have a string and start and length with which to extract a substring. Both positions (start and length) are based on the byte offsets in the original UTF8 string.

However, there is a problem:

The start and length are in bytes, so I cannot use "substring". The UTF8 string contains several multi-byte characters. Is there a hyper-efficient way of doing this? (I don't need to decode the bytes...)

Example： var orig = '你好吗？'

The s,e might be 3,3 to extract the second character (好). I'm looking for

var result = orig.substringBytes(3,3);

Help!

Update #1 In C/C++ I would just cast it to a byte array, but not sure if there is an equivalent in javascript. BTW, yes we could parse it into a byte array and parse it back to a string, but it seems that there should be a quick way to cut it at the right place. Imagine that 'orig' is 1000000 characters, and s = 6 bytes and l = 3 bytes.

Update #2 Thanks to zerkms helpful re-direction, I ended up with the following, which does NOT work right - works right for multibyte but messed up for single byte.

function substrBytes(str, start, length)
{
    var ch, startIx = 0, endIx = 0, re = '';
    for (var i = 0; 0 < str.length; i++)
    {
        startIx = endIx++;

        ch = str.charCodeAt(i);
        do {
            ch = ch >> 8;   // a better way may exist to measure ch len
            endIx++;
        }
        while (ch);

        if (endIx > start + length)
        {
            return re;
        }
        else if (startIx >= start)
        {
            re += str[i];
        }
    }
}

Update #3 I don't think shifting the char code really works. I'm reading two bytes when the correct answer is three... somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

The start and length for `substr` are in character, not bytes. — nhahtdh, Jun 26 '12 at 03:45
@zerkms - I found that too, though I think that decoding the whole string to bytes, picking off the substring and going back would be really inefficient. What if there are 10000000 characters and I want bytes 6-12? Seems that converting the whole string would be a terrible idea. — tofutim, Jun 26 '12 at 03:48
updated my answer to make the code compatible with UTF-8 input. now it does exactly what you ask for and does not rely on `Buffer()` — Kaii, Jun 26 '12 at 21:39
PS: if you can, change the input format of your "start" and "length" parameters to characters. This will really increase performance as JS is not really capable of handling utf-8 strings on byte level. (as explained, all input is converted to utf-16 internally) — Kaii, Jun 26 '12 at 23:19
@Kaii Indeed, that would be ideal. Unfortunately the output comes from SQLite and is rather fixed at this point. — tofutim, Jun 27 '12 at 01:24
then, if all comes from sqlite, you could use SQLites `SUBSTR()` to deliver the original string *and* the sub-string you need. like `SELECT mystring, SUBSTR(mystring, start, length) AS mysubstring FROM mytable` — Kaii, Jun 27 '12 at 09:54

Kaii · Accepted Answer · 2022-07-29T13:26:22.447

I had a fun time fiddling with this. Hope this helps.

Because Javascript does not allow direct byte access on a string, the only way to find the start position is a forward scan.

Update #3 I don't think shifting the char code really works. I'm reading two bytes when the correct answer is three... somehow I always forget this. The codepoint is the same for UTF8 and UTF16, but the number of bytes taken up on encoding depends on the encoding!!! So this is not the right way to do this.

This is not correct - Actually there is no UTF-8 string in javascript. According to the ECMAScript 262 specification all strings - regardless of the input encoding - must be internally stored as UTF-16 ("[sequence of] 16-bit unsigned integers").

Considering this, the 8 bit shift is correct (but unnecessary).

Wrong is the assumption that your character is stored as a 3-byte sequence...
In fact, all characters in a JS (ECMA-262) string are 16 bit (2 byte) long.

This can be worked around by converting the multibyte characters to utf-8 manually, as shown in the code below.

UPDATE This solution doesn't handle codepoints >= U+10000 including emoji. See APerson's Answer for a more complete solution.

See the details explained in my example code:

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗？';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"

Note that this answer doesn't handle code points U+10000 or above - including emoji. See my answer. — APerson, Jul 24 '22 at 05:29

score 8 · Answer 2 · edited Aug 01 '22 at 09:11

@Kaii 's answer is almost correct, but there is a bug in it. It fails to handle the characters Unicode of which are from 128 to 255. Here is the revised version(just change 256 to 128):

function encode_utf8( s )
{
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

   /* this function scans a multibyte string and returns a substring. 
    * arguments are start position and length, both defined in bytes.
    * 
    * this is tricky, because javascript only allows character level 
    * and not byte level access on strings. Also, all strings are stored
    * in utf-16 internally - so we need to convert characters to utf-8
    * to detect their length in utf-8 encoding.
    *
    * the startInBytes and lengthInBytes parameters are based on byte 
    * positions in a utf-8 encoded string.
    * in utf-8, for example: 
    *       "a" is 1 byte, 
            "ü" is 2 byte, 
       and  "你" is 3 byte.
    *
    * NOTE:
    * according to ECMAScript 262 all strings are stored as a sequence
    * of 16-bit characters. so we need a encode_utf8() function to safely
    * detect the length our character would have in a utf8 representation.
    * 
    * http://www.ecma-international.org/publications/files/ecma-st/ECMA-262.pdf
    * see "4.3.16 String Value":
    * > Although each value usually represents a single 16-bit unit of 
    * > UTF-16 text, the language does not place any restrictions or 
    * > requirements on the values except that they be 16-bit unsigned 
    * > integers.
    */

    var resultStr = '';
    var startInChars = 0;

    // scan string forward to find index of first character
    // (convert start position in byte to start position in characters)

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {

        // get numeric code of character (is >= 128 for multibyte character)
        // and increase "bytePos" for each byte of the character sequence

        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str[startInChars]).length;
    }

    // now that we have the position of the starting character,
    // we can built the resulting substring

    // as we don't know the end position in chars yet, we start with a mix of
    // chars and bytes. we decrease "end" by the byte count of each selected 
    // character to end up in the right position
    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        // get numeric code of character (is >= 128 for multibyte character)
        // and decrease "end" for each byte of the character sequence
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str[n]).length;

        resultStr += str[n];
    }

    return resultStr;
}

var orig = 'abc你好吗？©';

alert('res: ' + substr_utf8_bytes(orig, 0, 2)); // alerts: "ab"
alert('res: ' + substr_utf8_bytes(orig, 2, 1)); // alerts: "c"
alert('res: ' + substr_utf8_bytes(orig, 3, 3)); // alerts: "你"
alert('res: ' + substr_utf8_bytes(orig, 6, 6)); // alerts: "好吗"
alert('res: ' + substr_utf8_bytes(orig, 15, 2)); // alerts: "©"

By the way, it is a bug fix, and it SHOULD be useful for the ones who have the same problem.

i took this into credit and edited my answer. thanks for your sharp eyes — Kaii, Nov 21 '12 at 23:53

tofutim · Answer 3 · 2012-06-26T16:46:44.537

3

function substrBytes(str, start, length)
{
    var buf = new Buffer(str);
    return buf.slice(start, start+length).toString();
}

AYB

edited Jun 26 '12 at 16:46

answered Jun 26 '12 at 09:56

tofutim

22,664
20
87
148

i tried this, but i have no Buffer() object. which framework did you use? – Kaii Jun 26 '12 at 19:00
This doesn't work for me in Node.js. Returns a bunch of question mark characters. Regular substr works well. – Gavin Jul 02 '14 at 14:55

score 1 · Answer 4 · answered Mar 11 '14 at 12:06

For IE users, the codes in above answer will output undefined. Because, in IE, it is not supported str[n], in other words, you cannot use string as array. Your need to replace str[n] with str.charAt(n). The code should be;

function encode_utf8( s ) {
  return unescape( encodeURIComponent( s ) );
}

function substr_utf8_bytes(str, startInBytes, lengthInBytes) {

    var resultStr = '';
    var startInChars = 0;

    for (bytePos = 0; bytePos < startInBytes; startInChars++) {
        ch = str.charCodeAt(startInChars);
        bytePos += (ch < 128) ? 1 : encode_utf8(str.charAt(startInChars)).length;
    }

    end = startInChars + lengthInBytes - 1;

    for (n = startInChars; startInChars <= end; n++) {
        ch = str.charCodeAt(n);
        end -= (ch < 128) ? 1 : encode_utf8(str.charAt(n)).length;

        resultStr += str.charAt(n);
    }

    return resultStr;
}

score 1 · Answer 5 · answered Sep 07 '17 at 03:37

Maybe use this to count byte and example. It counts 你 character is 2 bytes, instead 3 bytes follow @Kaii's function:

jQuery.byteLength = function(target) {
    try {
        var i = 0;
        var length = 0;
        var count = 0;
        var character = '';
        //
        target = jQuery.castString(target);
        length = target.length;
        //
        for (i = 0; i < length; i++) {
            // 1 文字を切り出し Unicode に変換
            character = target.charCodeAt(i);
            //
            // Unicode の半角 : 0x0 - 0x80, 0xf8f0, 0xff61 - 0xff9f, 0xf8f1 -
            // 0xf8f3
            if ((character >= 0x0 && character < 0x81)
                    || (character == 0xf8f0)
                    || (character > 0xff60 && character < 0xffa0)
                    || (character > 0xf8f0 && character < 0xf8f4)) {
                // 1 バイト文字
                count += 1;
            } else {
                // 2 バイト文字
                count += 2;
            }
        }
        //
        return (count);
    } catch (e) {
        jQuery.showErrorDetail(e, 'byteLength');
        return (0);
    }
};

for (var j = 1, len = value.length; j <= len; j++) {
    var slice = value.slice(0, j);
    var slength = $.byteLength(slice);
    if ( slength == 106 ) {
        $(this).val(slice);
        break;
    }
}

APerson · Answer 6 · 2022-08-01T04:11:21.883

Kaii's answer is solid except it doesn't handle code points above U+10000 (like emoji) because they turn into surrogate pairs, which cause encodeURIComponent to throw an error. I copied it and changed some stuff:

// return how many bytes the UTF-16 code unit `s` would be, if represented in utf8
function utf8_len(s) {
    var charCode = s.charCodeAt(0);
    if (charCode < 128) return 1;
    if (charCode < 2048) return 2;
    if ((55296 <= charCode) && (charCode <= 56319)) return 4; // UTF-16 high surrogate
    if ((56320 <= charCode) && (charCode <= 57343)) return 0; // UTF-16 low surrogate
    if (charCode < 65536) return 3;
    throw 'Bad char';
}

// Returns the substring of `str` starting at UTF-8 byte index `startInBytes`,
// that extends for `lengthInBytes` UTF-8 bytes. May misbehave if the
// specified string does NOT start and end on character boundaries.
function substr_utf8_bytes(str, startInBytes, lengthInBytes) {
    var currCharIdx = 0;

    // Scan through the string, looking for the start of the substring
    var bytePos = 0;
    while (bytePos < startInBytes) {
        var utf8Len = utf8_len(str.charAt(currCharIdx));
        bytePos += utf8Len;
        currCharIdx++;

        // Make sure to include low surrogate
        if ((utf8Len == 4) && (bytePos == startInBytes)) {
            currCharIdx++;
        }
    }

    // We've found the substring; copy it to resultStr character by character
    var resultStr = '';
    var currLengthInBytes = 0;
    while (currLengthInBytes < lengthInBytes) {
        var utf8Len = utf8_len(str.charAt(currCharIdx));
        currLengthInBytes += utf8Len;
        resultStr += str[currCharIdx];
        currCharIdx++;

        // Make sure to include low surrogate
        if ((utf8Len == 4) && (currLengthInBytes == lengthInBytes)) {
            resultStr += str[currCharIdx];
        }
    }

    return resultStr;
}

var orig2 = 'abc你好吗？';

console.log('res: ' + substr_utf8_bytes('', 0, 4));
console.log('res: ' + substr_utf8_bytes('', 0, 4));
console.log('res: ' + substr_utf8_bytes('', 4, 4));
console.log('res: ' + substr_utf8_bytes(orig2, 0, 2)); // alerts: "ab"
console.log('res: ' + substr_utf8_bytes(orig2, 2, 1)); // alerts: "c"
console.log('res: ' + substr_utf8_bytes(orig2, 3, 3)); // alerts: "你"
console.log('res: ' + substr_utf8_bytes(orig2, 6, 6)); // alerts: "好吗"

(Note that "char" in the variable names should be something like "code unit" instead, but I got lazy.)

Wow. Thanks for improving! Since this is the more complete solution I updated my answer to reference you. — Kaii, Jul 29 '22 at 13:24
Also, i expect this solution to be way faster than my original code, because constantly escaping and unescaping each char in the loop is very inefficient, but I didn't know better at the time of writing. It was just a POC tho. — Kaii, Jul 29 '22 at 13:31

score -1 · Answer 7 · answered Jun 26 '12 at 04:22

-1

The System.ArraySegment is usefull,but you need to constructor with array input and offset and indexer.

answered Jun 26 '12 at 04:22

Houshang.Karami

291
1
3
11

Is that in javascript? Or just a C# library? – tofutim Jun 26 '12 at 09:21

Extract substring by utf-8 byte positions

7 Answers7