15

I am interacting with an api that accepts strings that are a maximum 5KB in size.

I want to take a string that may be more than 5KB and break it into chunks less than 5KB in size.

I then intend to pass each smaller-than-5kb-string to the api endpoint, and perform further actions when all requests have finished, probably using something like:

await Promise.all([get_thing_from_api(string_1), get_thing_from_api(string_2), get_thing_from_api(string_3)])

I have read that characters in a string can be between 1 - 4 bytes.

For this reason, to calculate string length in bytes we can use:

// in Node, string is UTF-8    
Buffer.byteLength("here is some text"); 

// in Javascript  
new Blob(["here is some text"]).size

Source:
https://stackoverflow.com/a/56026151
https://stackoverflow.com/a/52254083

My searches for "how to split strings into chunks of a certain size" return results that relate to splitting a string into strings of a particular character length, not byte length, eg:

var my_string = "1234 5 678905";

console.log(my_string.match(/.{1,2}/g));
// ["12", "34", " 5", " 6", "78", "90", "5"]

Source:
https://stackoverflow.com/a/7033662
https://stackoverflow.com/a/6259543
https://gist.github.com/hendriklammers/5231994

Question

Is there a way to split a string into strings of a particular byte length?

I could either:

  • assume that strings will only contain 1 byte per character
  • allow for the 'worst case scenario' that each character is 4 bytes

but would prefer a more accurate solution.

I would be interested to know of both Node and plain JavaScript solutions, if they exist.

EDIT

This approach to calculating byteLength might be helpful - by iterating over characters in a string, getting their character code and incrementing byteLength accordingly:

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

Source: https://stackoverflow.com/a/23329386

which led me to interesting experiments into the underlying data structures of Buffer:

var buf = Buffer.from('Hey! ф');
// <Buffer 48 65 79 21 20 d1 84>  
buf.length // 7
buf.toString().charCodeAt(0) // 72
buf.toString().charCodeAt(5) // 1092  
buf.toString().charCodeAt(6) // NaN    
buf[0] // 72
for (let i = 0; i < buf.length; i++) {
  console.log(buf[i]);
}
// 72 101 121 33 32 209 132 undefined
buf.slice(0,5).toString() // 'Hey! '
buf.slice(0,6).toString() // 'Hey! �'
buf.slice(0,7).toString() // 'Hey! ф'

but as @trincot pointed out in the comments, what is the correct way to handle multibyte characters? And how could I ensure chunks were split on spaces (so as not to 'break apart' a word?)

More info on Buffer: https://nodejs.org/api/buffer.html#buffer_buffer

EDIT

In case it helps anyone else understand the brilliant logic in the accepted answer, the snippet below is a heavily commented version I made so I could understand it better.

/**
 * Takes a string and returns an array of substrings that are smaller than maxBytes.  
 *
 * This is an overly commented version of the non-generator version of the accepted answer, 
 * in case it helps anyone understand its (brilliant) logic.  
 *
 * Both plain js and node variations are shown below - simply un/comment out your preference  
 * 
 * @param  {string} s - the string to be chunked  
 * @param  {maxBytes} maxBytes - the maximum size of a chunk, in bytes   
 * @return {arrray} - an array of strings less than maxBytes (except in extreme edge cases)    
 */
function chunk(s, maxBytes) {
  // for plain js  
  const decoder = new TextDecoder("utf-8");
  let buf = new TextEncoder("utf-8").encode(s);
  // for node
  // let buf = Buffer.from(s);
  const result = [];
  var counter = 0;
  while (buf.length) {
    console.log("=============== BEG LOOP " + counter + " ===============");
    console.log("result is now:");
    console.log(result);
    console.log("buf is now:");
    // for plain js
    console.log(decoder.decode(buf));
    // for node  
    // console.log(buf.toString());
    /* get index of the last space character in the first chunk, 
    searching backwards from the maxBytes + 1 index */
    let i = buf.lastIndexOf(32, maxBytes + 1);
    console.log("i is: " + i);
    /* if no space is found in the first chunk,
    get index of the first space character in the whole string,
    searching forwards from 0 - in edge cases where characters
    between spaces exceeds maxBytes, eg chunk("123456789x 1", 9),
    the chunk will exceed maxBytes */
    if (i < 0) i = buf.indexOf(32, maxBytes);
    console.log("at first condition, i is: " + i);
    /* if there's no space at all, take the whole string,
    again an edge case like chunk("123456789x", 9) will exceed maxBytes*/
    if (i < 0) i = buf.length;
    console.log("at second condition, i is: " + i);
    // this is a safe cut-off point; never half-way a multi-byte
    // because the index is always the index of a space    
    console.log("pushing buf.slice from 0 to " + i + " into result array");
    // for plain js
    result.push(decoder.decode(buf.slice(0, i)));
    // for node
    // result.push(buf.slice(0, i).toString());
    console.log("buf.slicing with value: " + (i + 1));
    // slice the string from the index + 1 forwards  
    // it won't erroneously slice out a value after i, because i is a space  
    buf = buf.slice(i + 1); // skip space (if any)
    console.log("=============== END LOOP " + counter + " ===============");
    counter++;
  }
  return result;
}

console.log(chunk("Hey there! € 100 to pay", 12));
user1063287
  • 10,265
  • 25
  • 122
  • 218
  • 2
    Is a split allowed to happen in the middle of a multibyte-character? – trincot Jul 17 '19 at 05:20
  • good question, it is for translating text to speech, generating either a single audio file (if text is less than 5kb) or multiple audio files (if text is more than 5kb), so i suppose the condition would have to say something like "break chunks at instances of a space character". – user1063287 Jul 17 '19 at 05:27
  • 1
    I love the layout of your question, and the Edits! – Ali Akram Dec 20 '20 at 03:57
  • 1
    There's [iter-ops](https://github.com/vitaly-t/iter-ops) module, which has flexible `split` and `page` operators for this. – vitaly-t Nov 29 '21 at 21:21

3 Answers3

17

Using Buffer seems indeed the right direction. Given that:

  • Buffer prototype has indexOf and lastIndexOf methods, and
  • 32 is the ASCII code of a space, and
  • 32 can never occur as part of a multi-byte character since all the bytes that make up a multi-byte sequence always have the most significant bit set.

... you can proceed as follows:

function chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    const result = [];
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take the whole string
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        result.push(buf.slice(0, i).toString());
        buf = buf.slice(i+1); // Skip space (if any)
    }
    return result;
}

console.log(chunk("Hey there! € 100 to pay", 12)); 
// -> [ 'Hey there!', '€ 100 to', 'pay' ]

You can consider extending this to also look for TAB, LF, or CR as split-characters. If so, and your input text can have CRLF sequences, you would need to detect those as well to avoid getting orphaned CR or LF characters in the chunks.

You can turn the above function into a generator, so that you control when you want to start the processing for getting the next chunk:

function * chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield buf.slice(0, i).toString();
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

Browsers

Buffer is specific to Node. Browsers however implement TextEncoder and TextDecoder, which leads to similar code:

function * chunk(s, maxBytes) {
    const decoder = new TextDecoder("utf-8");
    let buf = new TextEncoder("utf-8").encode(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield decoder.decode(buf.slice(0, i));
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Brilliant, I was playing around with `lastIndexOf(32)` but couldn't figure out how to dynamically create chunks that ended and began at that last instance of a space and push them to an array. Just curious, why is the `If no space found, try forward search` line required? Doesn't the previous line achieve the same goal, ie it gets the last instance of a space character (even if there is only one instance, close to the start of the string)? And doesn't the next line, `If there's no space at all, take all`, have the same condition, ie `if (i < 0)`, and overwrite the value assigned to `i`? – user1063287 Jul 17 '19 at 09:02
  • 1
    The edit just removed a step by replacing a slice with a second argument to `lastIndex`, but it comes down to the same logic. Let's say maxBytes is 1000. So the `lastIndex` part looks for the last space that occurs in the first 1001 characters. If that returns -1, that means there is no way to make the chunk. The second best is to look for the first space *after* position 1000. That would result in a slightly bigger chunk, but there just is no other way. When that also fails (-1), then we cannot do else than take all characters as single chunk. – trincot Jul 17 '19 at 10:02
  • 1
    Ah, i see, thank you so much for the clarification, just to test an extreme edge case, i did a test with a string of 14,002 characters where the first 5000 characters had no space between them, and `maxBytes` was 5000, and the behaviour was as you described. Thanks again. – user1063287 Jul 17 '19 at 11:38
1

A possible solution is to count every char bytes

function charByteCounter(char){
    let ch = char.charCodeAt(0)  // get char 
    let counter = 0
    while(ch) {
        counter++;
      ch = ch >> 8 // shift value down by 1 byte
    }  
   
    return counter
}

function * chunk(string, maxBytes) {
    let byteCounter = 0
    let buildString = ''
    for(const char of string){
        const bytes = charByteCounter(char)
        if(byteCounter + bytes > maxBytes){ // check if the current bytes + this char bytes is greater than maxBytes
            yield buildString // string with less or equal bytes number to maxBytes
            buildString = char
            byteCounter = bytes
            continue
        }
        buildString += char
        byteCounter += bytes
    }

    yield buildString
}

for (const s of chunk("Hey! , nice to meet you!", 12))
    console.log(s);

Sources:

Ido
  • 11
  • 3
  • This is a better solution for me. The accepted answer ended up with 3 chunks instead of 2. I suppose it was looking for a space character which isn't in my string. – iocoker Dec 12 '22 at 17:27
-1

Small addition to @trincot's answer:

If the string you are splitting contains a space (" "), then the returned array is always at least split into 2, even when the full string would fit into maxBytes (so should return only 1 item).

To fix this I added a check in the first line of the while loop:

export function chunkText (text: string, maxBytes: number): string[] {
  let buf = Buffer.from(text)
  const result = []
  while (buf.length) {
    let i = buf.length >= maxBytes ? buf.lastIndexOf(32, maxBytes + 1) : buf.length
    // If no space found, try forward search
    if (i < 0) i = buf.indexOf(32, maxBytes)
    // If there's no space at all, take the whole string
    if (i < 0) i = buf.length
    // This is a safe cut-off point; never half-way a multi-byte
    result.push(buf.slice(0, i).toString())
    buf = buf.slice(i+1) // Skip space (if any)
  }
  return result
}