I need to implement a method like this: int toCodePoint(byte [] buf, int startIndex); It should decode a UTF-8 char in byte array to code point. No extra objects should be created(that's the reason why I don't use JDK String class to do decode). Are there any existing java classes to do this? Thank you.
2 Answers
You can use java.nio.charset.CharsetDecoder to do that. You'll need a ByteBuffer
and a CharBuffer
. Put the data into ByteBuffer
, then use CharsetDecoder.decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
to read into the CharBuffer
. Then you can get the code point using Character.codePointAt(char[] a, int index)
. It is important to use this method because if your text has characters outside the BMP, they will be translated into two chars, so it's not sufficient to read only one char.
With this method you only need to create two buffers once, after that no new objects will be created unless some error occurs.

- 41,014
- 11
- 68
- 91
-
In Java `char` is a UTF-16 code unit, so for some code points this method will not give the correct code point value. – bames53 Feb 14 '12 at 15:46
-
@bames53 As a matter of clarification, this method doesn't exactly give a code point, it translates UTF-8 bytes into chars. If it encounters a code point outside of BMP, it will be translated into two chars. I have updated the answer a little to make it clear how you should read the result. – Malcolm Feb 14 '12 at 15:59
-
Thank you. Actually I wonder if the algo given below suits me? http://stackoverflow.com/questions/395832/how-to-get-code-point-number-for-a-given-character-in-a-utf-8-string – user1192878 Feb 14 '12 at 19:13
-
@user1192878 Of course, you can use it. It is less reliable as opposed to the standard library because the latter has been tested to death, but definitely feasible. – Malcolm Feb 14 '12 at 19:20
All existing Java classes i know are not fits for this task, because you have restriction ("No extra objects should be created"). Otherwise you could use CharsetDecoder (as mentioned by Malcolm). Or even come to dark side and use sun.io.ByteToCharUTF8 if you really need pure static method. But it is not recommended way.

- 2,159
- 1
- 15
- 17