Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner?

Question

I am trying to decode UTF8 byte by byte with charset decoder. Is this possible?

The following code

public static void main(String[] args) {

    Charset cs = Charset.forName("utf8");
    CharsetDecoder decoder = cs.newDecoder();
    CoderResult res;

    byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8

    byte[] b = new byte[1];
    ByteBuffer bb = ByteBuffer.wrap(b);

    char[] c = new char[1];
    CharBuffer cb = CharBuffer.wrap(c);

    decoder.reset();

    b[0] = source[0];
    bb.rewind();

    cb.rewind();
    res = decoder.decode(bb, cb, false);

    System.out.println(res);
    System.out.println(cb.remaining());

    b[0] = source[1];
    bb.rewind();

    cb.rewind();
    res = decoder.decode(bb, cb, false);

    System.out.println(res);
    System.out.println(cb.remaining());



}

gives the following output.

UNDERFLOW
1
MALFORMED[1]
1

Why?

@jlordo these reasons are offtopic in this question – Suzan Cioc Feb 09 '13 at 23:21 — Suzan Cioc, Feb 09 '13 at 23:21

Stephen C · Accepted Answer · 2013-02-10T00:11:48.580

4

My theory is that the problem with the way that you are doing it is that in the "underflow" condition, the decoder leaves the unconsumed bytes in the input buffer. At least, that is my reading.

Note this sentence in the javadoc:

"In any case, if this method is to be reinvoked in the same decoding operation then care should be taken to preserve any bytes remaining in the input buffer so that they are available to the next invocation. "

But you are clobbering the (presumably) unread byte.

You should be able to check whether my theory / interpretation is correct by looking at how many bytes remain unconsumed in bb after the first decode(...) call.

If my theory is correct then the answer is that you cannot decode UTF-8 by providing the decoder with byte buffers containing exactly one byte. But you could implement byte-by-byte decoding by starting with a ByteBuffer containing one byte and adding extra bytes until the decoder succeeds in outputing a character. Just make sure that you don't clobber input bytes that haven't been consumed yet.

Note that decoding like this is not efficient. The API design is optimized for decoding a large number of bytes in one go.

edited Feb 10 '13 at 00:11

answered Feb 09 '13 at 23:59

Stephen C

698,415
94
811
1,216

Yes I also noticed this now. But it is strange that this implementation relies on me to copy unconsumed bytes to the new buffer. Also this means that buffer can't be shorter than the longest character decoded. Particularly this means that it is IMPOSSIBLE to decode byte by byte. – Suzan Cioc Feb 10 '13 at 00:04
@SuzanCioc - not impossible. You just have to do it slightly differently. – Stephen C Feb 10 '13 at 00:06
but how? Decoder won't accept one byte and won't remember it. So I am obliged to feed it with 2 bytes (in current case). So I need at least 2-byte buffer. No way to feed by byte! – Suzan Cioc Feb 10 '13 at 00:12
@SuzanCioc - Yes you need a buffer with a capacity of up to 6 bytes. But you can still keep adding bytes one by one ... which should satisfy your higher-level requirement of byte-by-byte decoding. **Think outside the box!**. – Stephen C Feb 10 '13 at 00:16

Stephen · Answer 2 · 2013-02-10T00:03:54.650

3

As has been said, utf has 1-6 bytes per char. you need to add all bytes to the bytebuffer before you decode try this:

public static void main(String[] args) {

    Charset cs = Charset.forName("utf8");
    CharsetDecoder decoder = cs.newDecoder();
    CoderResult res;

    byte[] source = new byte[] {(byte)0xc3, (byte)0xa6}; // LATIN SMALL LETTER AE in UTF8

    byte[] b = new byte[2]; //two bytes for this char
    ByteBuffer bb = ByteBuffer.wrap(b);

    char[] c = new char[1];
    CharBuffer cb = CharBuffer.wrap(c);

    decoder.reset();

    b[0] = source[0];
    b[1] = source[1];
    bb.rewind();

    cb.rewind();
    res = decoder.decode(bb, cb, false); //translates 2 bytes to 1 char

    System.out.println(cb.remaining()); //prints 0
    System.out.println(cb.get(0)); //prints latin ae

}

edited Feb 10 '13 at 00:03

answered Feb 09 '13 at 23:29

Stephen

2,365
17
21

2

UTF-8 has anywhere from 1 to 6 bytes per character – Simon G. Feb 09 '13 at 23:34
How can I know in advance, how many bytes should I allocate? Suppose I will add one more byte, but it also can appear to be malformed. – Suzan Cioc Feb 09 '13 at 23:39
1

Allocate for six bytes. As long as the `CharsetDecoder` can read at least one full character at a time, it'll be happy; it'll just leave the extra bytes in the `ByteBuffer`, where you should `compact` them. – Louis Wasserman Feb 09 '13 at 23:42
3

LouisWasserman @SimonG. You're both wrong. UTF-8 can contain max. 4 bytes per character. See [this SO question](http://stackoverflow.com/questions/9533258/what-is-the-maximum-number-of-bytes-for-a-utf-8-encoded-character) or [my blog post](http://stijndewitt.wordpress.com/2014/08/09/max-bytes-in-a-utf-8-char/) on this topic. – Stijn de Witt Aug 08 '14 at 22:52

score 0 · Answer 3 · answered Aug 19 '23 at 11:59

Here is my solution. The following decode a utf-8 byte sequence, in a byte by byte manner.

public static void main(String[] args) {
    //The utf-8 bytes sequences that we'll decode it
    ByteBuffer byteSequence = ByteBuffer.wrap(
            "Привет Hello 你好 こんにちは 안녕하세요,".getBytes(StandardCharsets.UTF_8)
    );


    StringBuilder decodeResult = new StringBuilder();

    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
    ByteBuffer decodeBufIn = ByteBuffer.allocate(4);
    CharBuffer decodeBufOut = CharBuffer.allocate(2);

    // Due to the awful design of ByteBuffer, we need to maintain write position ourself
    int writePosition = 0;

    // Decode byte by byte
    while (byteSequence.remaining() > 0) {
        decodeBufIn.put(writePosition++, byteSequence.get());

        //Switch to read mode
        decodeBufIn.limit(writePosition);
        CoderResult r = decoder.decode(decodeBufIn, decodeBufOut, false);

        //Once the decoder produce an outcome , consume it
        if (r.isUnderflow() || r.isOverflow()) {
            if (decodeBufOut.position() > 0) {
                decodeBufOut.flip();
                decodeResult.append(decodeBufOut);
                decodeBufOut.clear();

                decodeBufIn.clear();
                writePosition = 0;
            }
        }else{
            r.throwException();
        }

        //Switch to write mode
        decodeBufIn.limit(decodeBufIn.capacity());

        if (writePosition >= decodeBufIn.capacity()) {
            throw new IllegalStateException("This should never occur!");
        }
    }

    System.out.println(decodeResult);
}

Decoding multibyte UTF8 symbols with charset decoder in byte-by-byte manner?

3 Answers3