Splitting a string containing multi-byte characters into an array of strings

Question

I have this piece of code which is intended to split strings into an array of strings using CHUNK_SIZE as the size of the split, in bytes (I'm doing this for paginating results). This works in most cases when characters are 1 byte, but when I have a multi-byte character (such as for example 2-byte french characters (like é) or 4 byte chinese characters) at precisely the split location, I end up with 2 unreadable characters at the end of my first array element and at the start of the second one.

Is there a way to fix the code to account for multibyte characters so they are maintained in the final result?

public static ArrayList<String> splitFile(String data) throws Exception {
    ArrayList<String> messages = new ArrayList<>();
    int CHUNK_SIZE = 400000;// 0.75mb

    if (data.getBytes().length > CHUNK_SIZE) {
        byte[] buffer = new byte[CHUNK_SIZE];
        int start = 0, end = buffer.length;
        long remaining = data.getBytes().length;
        ByteArrayInputStream inputStream =
                new ByteArrayInputStream(data.getBytes());

        while ((inputStream.read(buffer, start, end)) != -1) {
            ByteArrayOutputStream outputStream =
                    new ByteArrayOutputStream();
            outputStream.write(buffer, start, end);
            messages.add(outputStream.toString("UTF-8"));
            remaining = remaining - end;

            if (remaining <= end) {
                end = (int) remaining;
            }
        }
        return messages;
    }

    messages.add(data);
    return messages;
}

Bohemian · Answer 1 · 2021-01-07T01:05:10.710

2

You want to:

count characters not bytes
use regex for the chunk size and word boundary sensitivity
write less code

ergo,

private static int CHUNK_SIZE = 400000;

public static ArrayList<String> splitFile(String data) {
    return Arrays.asList(data.split("(?s)(?<=\\G.{1," + CHUNK_SIZE + "}\\b) +"));
}

Breaking down the regex:

(?s means “dot should match new lines”
\G means “the end of the last match”, and is initialized to start of input
\b means “word boundary”
(?<=\G.{1,400000}\b) means “preceded by the end of the last match then up to 400000 characters then a word boundary”

Not sure if you really need a List returned or not. You could just return the string array from the split.

edited Jan 07 '21 at 01:05

answered Dec 29 '20 at 02:36

Bohemian

412,405
93
575
722

Thanks but it does seem like the split is working. I reduced the CHUNK_SIZE to 6 for easier visibility for testing and I fed it a string with a french character at the 6th location input is asdfgéwert but my output was [asdfgéwert] . – stfudonny Jan 07 '21 at 00:47
@stfudonny there was a bug (I added `" +"` to the end of the regex), but this regex is designed to split on *spaces*, ie to leave words whole. If there's no space within the max length, there will be no split. I tried this with accented chars and it works - see [live demo](https://ideone.com/Qnnvfw). See if it works for you with input that includes spaces - ie `"asdfgé wert"` and a length of 6. – Bohemian Jan 07 '21 at 01:07

Joop Eggen · Accepted Answer · 2021-01-07T05:22:41.583

public static List<String> splitFile(String data) throws IOException {
    List<String> messages = new ArrayList<>();
    final int CHUNK_SIZE = 400_000;// 0.75mb

    byte[] dataBytes = data.getBytes(StandardCharsets.UTF_8);
    byte[] buffer = new byte[CHUNK_SIZE];
    int start = 0;
    final int end = CHUNK_SIZE;
    ByteArrayInputStream inputStream = new ByteArrayInputStream(dataBytes);

    for (; ; ) {
        int read = inputStream.read(buffer, start, end - start);
        if (read == -1) {
            if (start != 0) {
                messages.add(new String(buffer, 0, start,
                        StandardCharsets.UTF_8));
            }
            break;
        }
        // Check for half read multi-byte sequences:
        int fullEnd = start + read;
        while (fullEnd > 0) {
            byte b = buffer[fullEnd - 1];
            if (b >= 0) { // ASCII.
                break;
            }
            if ((b & 0xC0) == 0xC0) { // Start byte of sequence.
                --fullEnd;
                break;
            }
            --fullEnd;
        }
        messages.add(new String(buffer, 0, fullEnd, StandardCharsets.UTF_8));
        start += read - fullEnd;
        if (start > 0) { // Copy the bytes after fullEnd to the start.
            System.arraycopy(buffer, fullEnd, buffer, 0, start);
            //               src     srcI     dest    destI len
        }
    }
    return messages;
}

I have kept the ByteArrayInputStream, as most often one reads from InputStream, instead of having all bytes in memory.

Then the chunk buffer is read, from start rather then from 0, as there might linger some bytes from the prior chunk read.

Reading gives the number of bytes read or -1.

At the end an ASCII char is okay, otherwise I position the end at the beginning of a multibyte sequence. Maybe that sequence is completely read, maybe not. Here I just keep it for the next chunk being read.

This code did not see a compiler.

A List of messages is not memory friendly too.

BTW on char[] one would have a similar problem, sometimes a Unicode code point, symbol, is two (UTF-16) chars.

Thanks for your help, but I'm getting ArrayIndexOutOfBounds at ByteInputStream and I cannot figure out why — stfudonny, Jan 07 '21 at 00:38
I corrected; the new `start` value was wrong, and on loop break `start` bytes could remain. — Joop Eggen, Jan 07 '21 at 05:24
Thank you, this seems to work perfectly. I am trying to make sense of the details though, what does this conditional statement mean? if ((b & 0xC0) == 0xC0) { // Start byte of sequence. — stfudonny, Jan 31 '21 at 18:45
The multi-byte sequences start with a byte with the bits 11xxxxxx. (The continuation bytes then match bits 10xxxxxx.) 0xC0 == 0b1100_0000. When `b` has bits ABxxxxxx, then `b & 0xC0` has bits AB000000. And finally `== 0xC0` ensures A==1 and B==1. _(Bit operations.)_ — Joop Eggen, Jan 31 '21 at 20:17

score 1 · Answer 3 · 2020-12-31T08:39:12.193

Since you are doing this for paginating results, it may be useful to split this text not by characters but by words. You can iterate over the indices of the characters of this string and check each word whether at least half of it fits on the page, and if not, start a new page.

Example with limited line size on one page. It works the same with limited page size in multi-page document:

String text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, " +
        "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. " +
        "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris " +
        "nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in " +
        "reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla " +
        "pariatur. Excepteur sint occaecat cupidatat non proident, sunt in " +
        "culpa qui officia deserunt mollit anim id est laborum.";

int length = 55;

ArrayList<String> lines = new ArrayList<>();

int lastWord = 0;
int lastLine = 0;
for (int i = 0; i < text.length(); i++) {
    if (text.charAt(i) == ' ') {
        if (i - lastLine + (i - lastWord) / 2 > length) {
            lines.add(text.substring(lastLine, i));
            lastLine = i + 1;
        }
        lastWord = i + 1;
    }
}
lines.add(text.substring(lastLine));

// output line by line
lines.forEach(System.out::println);

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur
sint occaecat cupidatat non proident, sunt in culpa qui
officia deserunt mollit anim id est laborum.

^{See also: How to split a string after a certain length? But it should be divided after word completion}

I only need to split by chars since I think I'll mostly be splitting XML objects and not human readable text. But thanks for the great info! — stfudonny, Jan 07 '21 at 00:34

Splitting a string containing multi-byte characters into an array of strings

3 Answers3

Linked