I have this piece of code which is intended to split strings into an array of strings using CHUNK_SIZE as the size of the split, in bytes (I'm doing this for paginating results). This works in most cases when characters are 1 byte, but when I have a multi-byte character (such as for example 2-byte french characters (like é) or 4 byte chinese characters) at precisely the split location, I end up with 2 unreadable characters at the end of my first array element and at the start of the second one.
Is there a way to fix the code to account for multibyte characters so they are maintained in the final result?
public static ArrayList<String> splitFile(String data) throws Exception {
ArrayList<String> messages = new ArrayList<>();
int CHUNK_SIZE = 400000;// 0.75mb
if (data.getBytes().length > CHUNK_SIZE) {
byte[] buffer = new byte[CHUNK_SIZE];
int start = 0, end = buffer.length;
long remaining = data.getBytes().length;
ByteArrayInputStream inputStream =
new ByteArrayInputStream(data.getBytes());
while ((inputStream.read(buffer, start, end)) != -1) {
ByteArrayOutputStream outputStream =
new ByteArrayOutputStream();
outputStream.write(buffer, start, end);
messages.add(outputStream.toString("UTF-8"));
remaining = remaining - end;
if (remaining <= end) {
end = (int) remaining;
}
}
return messages;
}
messages.add(data);
return messages;
}