Substring or characterAt method for UTF8 Strings with 2+ bytes in JAVA

Question

I'm trying to find a substring method, or characterAt method that works on string containing UTF-8 encoded text in JAVA.

Internally, JAVA works with UTF-16. This means that a String is composed of chars with a size of 2 bytes. A UTF-8 character can be up to 6 bytes in size. When JAVA stores this inside a String, it splits the UTF-8 character over multiple chars.

For example: The character U+20000 (UTF-8 Hex: F0 A0 80 80) is stored internally in JAVA as a String with two chars (UTF-16 Hex: D840 and DC00).

When you have a String containing a 4 byte UTF-8 character, and use length, the answer is "2". When you use substring(0,1), you get the first half of the character.

Some code to illustrate this:

    ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xF0, (byte)0xA0, (byte)0x80, (byte)0x80});
    CharBuffer data = Charset.forName("UTF-8").decode(inputBuffer);
    String string_test = data.toString();
    int length = string_test.length();
    String first_half = string_test.substring(0, 1);
    String second_half = string_test.substring(1, 2);
    String full_character = string_test.substring(0, 2);

All this, even if unexpected, is not a bug, since JAVA works in UTF-16. Inherent UTF-8 support would be nice. But it's not there.

Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support? As in:

utf8string.length() - returns 1 if there is one 4 byte character in
there
utf8string.getCharacterAt(0) - returns the first character, not the first half of it.
utf8string.substring(0,1) - returns the first character, not the first half of it.

Or, what is the commonly used solution for this? Convert all non UTF-16 supported UTF-8 characters to a default UTF-16 character when reading UTF-8 files? And, as a result, loosing all information on characters in the codepoint range that UTF-16 doesn't support? That is not necessarily an issue in my specific implementation, so if there is a common way of doing this, i'd be interested.

score 8 · Accepted Answer · answered Jul 08 '13 at 10:36

8

Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support?

You're not after UTF-8 support really. You're after Unicode code points (plain 32-bit integers), rather than UTF-16 code units. And yes, Java provides support for this, but it's not hugely easy to work with.

For example, to get a particular code point, use String.codePointAt - bearing in mind that the index you provide is in terms of UTF-16 code units, not code points.

To find the length in code points, use String.codePointCount.

To find a substring, you need to find the offset in terms of UTF-16 code units, then use the normal substring method; use String.offsetByCodePoints to find the right index.

Basically look through the String API at all the methods which contain codePoint.

answered Jul 08 '13 at 10:36

Jon Skeet

1,421,763
867
9,128
9,194

Thanks, that answered the first part of my question. For the second part, i've used http://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java. Since i didn't want to have characters in those codepoints complicating my string operations. – Wouter Jul 08 '13 at 14:59
Also, for other people that might -need- all code points, it might be interesting to have a look at: http://avro.apache.org/docs/1.6.1/api/java/org/apache/avro/util/Utf8.html – Wouter Jul 08 '13 at 15:01
So, it is this for substring? public static String substringUtf8(String utf8String, int from, int to) { return utf8String.substring(utf8String.offsetByCodePoints(0, from), utf8String.offsetByCodePoints(0, to));} – RobertG May 15 '14 at 12:02
@RobertG: It's not clear what you mean by "utf8String" in this case... a string is always UTF-16... if you mean substringByCodePoints then that might be correct. – Jon Skeet May 15 '14 at 12:12

score 0 · Answer 2 · answered Jul 08 '13 at 10:35

0

What you should be looking for is Java's native support for UTF-32. Check out String#*codePoint* methods, such as codePointAt.

answered Jul 08 '13 at 10:35

Marko Topolnik

195,646
29
319
436

Substring or characterAt method for UTF8 Strings with 2+ bytes in JAVA

2 Answers2