How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

Question

How do I truncate a java String so that I know it will fit in a given number of bytes storage once it is UTF-8 encoded?

Matt Quail · Accepted Answer · 2008-09-23T14:11:51.263

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {
    int b = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);

        // ranges from http://en.wikipedia.org/wiki/UTF-8
        int skip = 0;
        int more;
        if (c <= 0x007f) {
            more = 1;
        }
        else if (c <= 0x07FF) {
            more = 2;
        } else if (c <= 0xd7ff) {
            more = 3;
        } else if (c <= 0xDFFF) {
            // surrogate area, consume next char as well
            more = 4;
            skip = 1;
        } else {
            more = 3;
        }

        if (b + more > maxBytes) {
            return s.substring(0, i);
        }
        b += more;
        i += skip;
    }
    return s;
}

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

private static void test(String s, int maxBytes, int expectedBytes) {
    String result = truncateWhenUTF8(s, maxBytes);
    byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
    if (utf8.length > maxBytes) {
        System.out.println("BAD: our truncation of " + s + " was too big");
    }
    if (utf8.length != expectedBytes) {
        System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
    }
    System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
    test("abcd", 0, 0);
    test("abcd", 1, 1);
    test("abcd", 2, 2);
    test("abcd", 3, 3);
    test("abcd", 4, 4);
    test("abcd", 5, 4);

    test("a\u0080b", 0, 0);
    test("a\u0080b", 1, 1);
    test("a\u0080b", 2, 1);
    test("a\u0080b", 3, 3);
    test("a\u0080b", 4, 4);
    test("a\u0080b", 5, 4);

    test("a\u0800b", 0, 0);
    test("a\u0800b", 1, 1);
    test("a\u0800b", 2, 1);
    test("a\u0800b", 3, 1);
    test("a\u0800b", 4, 4);
    test("a\u0800b", 5, 5);
    test("a\u0800b", 6, 5);

    // surrogate pairs
    test("\uD834\uDD1E", 0, 0);
    test("\uD834\uDD1E", 1, 0);
    test("\uD834\uDD1E", 2, 0);
    test("\uD834\uDD1E", 3, 0);
    test("\uD834\uDD1E", 4, 4);
    test("\uD834\uDD1E", 5, 4);

}

Updated Modified code example, it now handles surrogate pairs.

UTF-8 can encode any UCS2 character in 3 bytes or less. Check that page you reference. However, if you want to comply with UCS4 or UTF16 (which can both reference the entire charset), you'll need to allow for up to 6-byte characters in UTF8. — billjamesdev, Sep 23 '08 at 23:11
Bill: see the CESU-8 discussion on the wikipedia page. My understanding is UTF-8 is supposed to encode surrogate pairs as a single 4-byte sequence, not two 3-byte sequences. — Matt Quail, Sep 24 '08 at 00:00
It's not 2 three-byte, it's up to 1 6-byte sequence to store UCS4, which is a full 31-bit character, not 2 16-bit "pairs" (that's UTF16). A 6-byte seq = 1111110C 10CCCCCC 10CCCCCC 10CCCCCC 10CCCCCC 10CCCCCC where the C's are data bits. Right now, only enough chars are in use to need 4 bytes. — billjamesdev, Sep 24 '08 at 05:24
But 8 years ago, more than 16-bits wasn't even necessary. Expect to see 5-byte chars in the next decade as more dialects and "Klingon"-type language planes are added. — billjamesdev, Sep 24 '08 at 05:26
Bill: you are correct, my code does not handle code points above U+10FFFF -- which is where more than 4 UTF-8 bytes are required. But Java can't encode characters past U+10FFFF anyway. Each `char` in Java is a 16 bit codepoint between U+0000 and U+FFFF. Surrogate pairs give you up to U+10FFFF. — Matt Quail, Sep 24 '08 at 11:52
Well, then, it would seem my solution is in excess. Didn't know that about Java's character (my I18n work was done for EQ in C++). Nice chat. :) — billjamesdev, Sep 24 '08 at 14:03
That won’t work for graphemes. It’s just as bad to truncate a partial grapheme as it is to truncate a partial character. — tchrist, Apr 24 '11 at 19:54
@tchrist well, it actually isn't quite as bad because software won't choke on trying to decode them — Chad, Jun 15 '12 at 01:07
Does this really need to be O(N)? Why not truncate the bytes, then look at the last 4 to figure out if you cut off in the middle of a unicode character, so that it's O(1). — user2615861, Dec 28 '16 at 23:30

score 26 · Answer 2 · edited Jun 18 '21 at 07:36

26

You should use CharsetEncoder, the simple getBytes() + copy as many as you can can cut UTF-8 charcters in half.

Something like this:

public static int truncateUtf8(String input, byte[] output) {
    
    ByteBuffer outBuf = ByteBuffer.wrap(output);
    CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());

    CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
    utf8Enc.encode(inBuf, outBuf, true);
    System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
    return outBuf.position();
}

edited Jun 18 '21 at 07:36

hongsy

1,498
1
27
39

answered Sep 23 '08 at 06:11

mitchnull

6,161
2
31
23

2

This worked great for me -- probably less efficient, but much harder to get wrong, and it works for any character set. Works nicely with a quick `new String(output, 0, output.length - returnValue, CHARSET)` – ojrac Jun 27 '11 at 17:53
@sigget's solution is similar and in addition returns the actual truncated string, instead of just the length – Peter Davis Sep 23 '16 at 18:17
If this was, let's say for Oracle, shouldn't the `UTF-8` be replaced with whatever encoding is the target column defined with? – Jaroslav Záruba Jun 18 '18 at 07:08

score 24 · Answer 3 · edited Aug 01 '17 at 15:00

Here's what I came up with, it uses standard Java APIs so should be safe and compatible with all the unicode weirdness and surrogate pairs etc. The solution is taken from http://www.jroller.com/holy/entry/truncating_utf_string_to_the with checks added for null and for avoiding decoding when the string is fewer bytes than maxBytes.

/**
 * Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
 * half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
 * character.
 *
 * Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
 */
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
    if (s == null) {
        return null;
    }
    Charset charset = Charset.forName("UTF-8");
    CharsetDecoder decoder = charset.newDecoder();
    byte[] sba = s.getBytes(charset);
    if (sba.length <= maxBytes) {
        return s;
    }
    // Ensure truncation by having byte buffer = maxBytes
    ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
    CharBuffer cb = CharBuffer.allocate(maxBytes);
    // Ignore an incomplete character
    decoder.onMalformedInput(CodingErrorAction.IGNORE)
    decoder.decode(bb, cb, true);
    decoder.flush(cb);
    return new String(cb.array(), 0, cb.position());
}

`CharBuffer.allocate(maxBytes)` allocates too much. Could it be `CharBuffer.allocate(s.length())`? — Peter Davis, Sep 23 '16 at 18:14

billjamesdev · Answer 4 · 2008-09-28T02:42:10.510

10

UTF-8 encoding has a neat trait that allows you to see where in a byte-set you are.

check the stream at the character limit you want.

If its high bit is 0, it's a single-byte char, just replace it with 0 and you're fine.
If its high bit is 1 and so is the next bit, then you're at the start of a multi-byte char, so just set that byte to 0 and you're good.
If the high bit is 1 but the next bit is 0, then you're in the middle of a character, travel back along the buffer until you hit a byte that has 2 or more 1s in the high bits, and replace that byte with 0.

Example: If your stream is: 31 33 31 C1 A3 32 33 00, you can make your string 1, 2, 3, 5, 6, or 7 bytes long, but not 4, as that would put the 0 after C1, which is the start of a multi-byte char.

edited Sep 28 '08 at 02:42

answered Sep 23 '08 at 06:07

billjamesdev

14,554
6
53
76

http://java.sun.com/j2se/1.5.0/docs/api/java/io/DataInput.html#modified-utf-8 explains the modified UTF-8 encoding used by Java and demonstrates why this answer is correct. – Alexander Sep 23 '08 at 11:08
1

BTW, this solution (the one bill @Bill James) is much more efficient than the currently accepted answer by @Matt Quail, because the former requires you to test 3 bytes at the most, whereas the latter requires you to test all characters in the text. – Alexander Sep 23 '08 at 17:32
1

Alexander: the former requires you to *first convert the string to UTF8*, which requires iterating over all the characters in the text. – Matt Quail Sep 24 '08 at 00:02
True, but the question does state "Once it is UTF-8 encoded". Presumably that price has been paid. – billjamesdev Sep 24 '08 at 14:05
2

@Alexander: That’s because they screwed up. That’s just trying to paper over the blunder. Surrogate pairs **HAVE NO BUSINESS IN UTF-8!** – tchrist Apr 24 '11 at 19:57
There's a special case that I think should be considered: We might actually be at the last byte of a multi-byte character (I guess we would have to look at the next byte to find out whether this is the case). In that case we should not go back (and thereby trim 1 character too many), but just stay where we are. – chris Mar 02 '23 at 14:04
@chris that case would be solved by the 2nd rule above when you process the next byte... you still need to do so in order to place the termination byte (0). – billjamesdev Mar 02 '23 at 17:23

score 9 · Answer 5 · answered Oct 24 '18 at 06:53

9

you can use -new String( data.getBytes("UTF-8") , 0, maxLen, "UTF-8");

answered Oct 24 '18 at 06:53

Suresh Gupta

605
7
4

Although your solution looked the best this code gives me a StringIndexOutOfBoundsException: String index out of range: 300: String str = "kt on ivp (day 3) - part 2 - 19 haziran 2018 sal_ 11.36.21.mp4."; System.out.println("Len is " + str.getBytes(StandardCharsets.UTF_8.name()).length); String finalTitle = new String(str.getBytes(StandardCharsets.UTF_8.name()), 0, Constants.MAX_TITLE_LENGTH, StandardCharsets.UTF_8.name()); Constants.MAX_TITLE_LENGTH is 300. @Suresh Gupta Do you know why? – Investigator Nov 14 '18 at 13:43
first, check your string length if it is less than your max limit it will though exception. Make sure your string length should be more than your max lmit before truncating – Suresh Gupta Nov 15 '18 at 07:29
the original question was literally how to perform that check optimally. – billjamesdev Mar 04 '23 at 17:45

user19050 · Answer 6 · 2008-09-23T08:11:54.450

You can calculate the number of bytes without doing any conversion.

foreach character in the Java string
  if 0 <= character <= 0x7f
     count += 1
  else if 0x80 <= character <= 0x7ff
     count += 2
  else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
     count += 3
  else if 0xdc00 <= character <= 0xffff
     count += 3
  else { // surrogate, a bit more complicated
     count += 4
     skip one extra character in the input stream
  }

You would have to detect surrogate pairs (D800-DBFF and U+DC00–U+DFFF) and count 4 bytes for each valid surrogate pair. If you get the first value in the first range and the second in the second range, it's all ok, skip them and add 4. But if not, then it is an invalid surrogate pair. I am not sure how Java deals with that, but your algorithm will have to do right counting in that (unlikely) case.

walen · Answer 7 · 2022-04-27T07:53:16.157

0

Based on billjamesdev's answer I've come up with the following method which, as far as I can tell, is the simplest and still works OK with surrogate pairs:

public static String utf8ByteTrim(String s, int trimSize) {
    final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
        while ((bytes[trimSize-1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
            trimSize--;
        }
        trimSize--;
    }
    return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}

Some testing:

String test = "Aæ尝试";
IntStream.range(1, 16).forEachOrdered(i ->
        System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);

---

Size 1: A
Size 2: A
Size 3: A
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ
Size 8: Aæ
Size 9: Aæ
Size 10: Aæ
Size 11: Aæ尝
Size 12: Aæ尝
Size 13: Aæ尝试
Size 14: Aæ尝试
Size 15: Aæ尝试

edited Apr 27 '22 at 07:53

answered Dec 13 '21 at 17:46

walen

7,103
2
37
58

There are cases when this code removes a character to much. Consider for example the String "木" (consisting of a single 3-byte character). With trimSize 3 I would expect to obtain the same String as result of utf8ByteTrim. However, the actual result is "". The problem occurrs when we are on the last byte of a multi-byte char. – chris Mar 02 '23 at 14:02
The original code has two bugs. FIrst, it breaches the bound of the array. Second, as mentioned in the prior comment, it truncates even if the entire last character fits. So an initial check is needed to be sure we don't scan beyond the bounds of the array, and a test must be done to see if the last character fits. – Brent K. May 19 '23 at 18:45
@BrentK. while I appreciate the fixes, my intention was to come up with a method as simple as possible -- I mentioned this in the first paragraph. Your corrections introduce a lot of checks and thus complexity into the method, turning it into the longest of all the answers, which goes against my intention. I rejected your edit, but feel free to remove your upvote if you want and/or post a new answer with your code (I'd appreciate a mention if you did so, just like I did, but I won't mind if you don't include one). Thanks anyway! – walen May 19 '23 at 20:40

score 0 · Answer 8 · answered May 21 '23 at 00:56

Scanning from the tail end of the string is far more efficient that scanning from the beginning, especially on very long strings. So walen was on the right path, unfortunately that answer does not provide the correct truncation.

If you would like a solution that scans backwards only a few characters, this is the best option.

Using the data in billjamesdev's answer we can effectively scan backwards and correctly get the truncation on a character boundary.

public static String utf8ByteTrim(String s, int requestedTrimSize) {
    final byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
    int maxTrimSize = Integer.min(requestedTrimSize, bytes.length);
    int trimSize = maxTrimSize;
    if ((bytes[trimSize-1] & 0x80) != 0) { // inside a multibyte sequence
        while ((bytes[trimSize - 1] & 0x40) == 0) { // 2nd, 3rd, 4th bytes
            trimSize--;
        }
        trimSize--;  // Get to the start of the UTF-8
        // Now see if that final UTF-8 character fits.
        // Assume the UTF-8 starts with binary 110xxxxx and is 2 bytes
        int numBytes = 2;  
        if ((bytes[trimSize] & 0xF0) == 0xE0) {
            // If the UTF-8 starts with binary 1110xxxx it is 3 bytes
            numBytes = 3;
        } else if ((bytes[trimSize] & 0xF8) == 0xF0) {
            // If the UTF-8 starts with binary 11110xxx it is 3 bytes
            numBytes = 4;
        }
        if( (trimSize + numBytes) == maxTrimSize)  {
            // The entire last UTF-8 character fits
            trimSize = maxTrimSize; 
        }
    }
    return new String(bytes, 0, trimSize, StandardCharsets.UTF_8);
}

There is only one while loop that will execute at most 3 iterations as it walks backward. Then a few if statements will determine which character to truncate.

Some testing:

String test = "Aæ尝试"; // Sizes: (1,2,4,3,3) = 13 bytes
IntStream.range(1, 16).forEachOrdered(i ->
        System.out.println("Size " + i + ": " + utf8ByteTrim(test, i))
);

---

Size 1: A
Size 2: A
Size 3: Aæ
Size 4: Aæ
Size 5: Aæ
Size 6: Aæ
Size 7: Aæ
Size 8: Aæ
Size 9: Aæ
Size 10: Aæ尝
Size 11: Aæ尝
Size 12: Aæ尝
Size 13: Aæ尝试
Size 14: Aæ尝试
Size 15: Aæ尝试

How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

8 Answers8

Linked