I'm trying to find a substring method, or characterAt method that works on string containing UTF-8 encoded text in JAVA.
Internally, JAVA works with UTF-16. This means that a String is composed of chars with a size of 2 bytes. A UTF-8 character can be up to 6 bytes in size. When JAVA stores this inside a String, it splits the UTF-8 character over multiple chars.
For example: The character U+20000 (UTF-8 Hex: F0 A0 80 80) is stored internally in JAVA as a String with two chars (UTF-16 Hex: D840 and DC00).
When you have a String containing a 4 byte UTF-8 character, and use length, the answer is "2". When you use substring(0,1), you get the first half of the character.
Some code to illustrate this:
ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xF0, (byte)0xA0, (byte)0x80, (byte)0x80});
CharBuffer data = Charset.forName("UTF-8").decode(inputBuffer);
String string_test = data.toString();
int length = string_test.length();
String first_half = string_test.substring(0, 1);
String second_half = string_test.substring(1, 2);
String full_character = string_test.substring(0, 2);
All this, even if unexpected, is not a bug, since JAVA works in UTF-16. Inherent UTF-8 support would be nice. But it's not there.
Does JAVA have any class in the default library, or does a class exist somewhere that provides UTF-8 support? As in:
- utf8string.length() - returns 1 if there is one 4 byte character in
there - utf8string.getCharacterAt(0) - returns the first character, not the first half of it.
- utf8string.substring(0,1) - returns the first character, not the first half of it.
Or, what is the commonly used solution for this? Convert all non UTF-16 supported UTF-8 characters to a default UTF-16 character when reading UTF-8 files? And, as a result, loosing all information on characters in the codepoint range that UTF-16 doesn't support? That is not necessarily an issue in my specific implementation, so if there is a common way of doing this, i'd be interested.