6

I have a problem with CharsetDecoder class.

First example of code (which works):

    final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final ByteBuffer b = ByteBuffer.allocate(3);
    final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
    for (int i=0; i<tab.length; i++){
        b.put(tab, i, 1);
    }
    try {
        b.flip();
        System.out.println("a" + dec.decode(b).toString() + "a");
    } catch (CharacterCodingException e1) {
        e1.printStackTrace();
    }

The result is a€a

But when i execute this code:

    final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final CharBuffer chars = CharBuffer.allocate(3);
    final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
    for (int i=0; i<tab.length; i++){
        ByteBuffer buffer = ByteBuffer.wrap(tab, i, 1);
        dec.decode(buffer, chars, i == 2);
    }
    dec.flush(chars);
    System.out.println("a" + chars.toString() + "a");

The result is a

Why is not the same result?

How to use the method decode(ByteBuffer, CharBuffer, endOfInput) of class CharsetDecoder in order to retrieve the result a€a ?

-- EDIT --

So with code of Jesper I do that. It's no perfect but works with a step = 1, 2 and 3

final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final CharBuffer chars = CharBuffer.allocate(6);
    final byte[] tab = new byte[]{(byte)97, (byte)-30, (byte)-126, (byte)-84, (byte)97, (byte)97}; //char €

    final ByteBuffer buffer = ByteBuffer.allocate(10);

    final int step = 3;
    for (int i = 0; i < tab.length; i++) {
        // Add the next byte to the buffer
        buffer.put(tab, i, step);
        i+=step-1;

        // Remember the current position
        final int pos = buffer.position();
        int l=chars.position();

        // Try to decode
        buffer.flip();
        final CoderResult result = dec.decode(buffer, chars, i >= tab.length -1);
        System.out.println(result);

        if (result.isUnderflow() && chars.position() == l) {
            // Underflow, prepare the buffer for more writing
            buffer.position(pos);
        }else{
            if (buffer.position() == buffer.limit()){
                //ByteBuffer decoded
                buffer.clear();
                buffer.position(0);
            }else{
                //a part of ByteBuffer is decoded. We keep only bytes which are not decoded
                final byte[] b = buffer.array();
                final int f = buffer.position();
                final int g = buffer.limit() - buffer.position();
                buffer.clear();
                buffer.position(0);
                buffer.put(b, f, g);
            }
        }
        buffer.limit(buffer.capacity());
    }

    dec.flush(chars);
    chars.flip();

    System.out.println(chars.toString());
lecogiteur
  • 307
  • 1
  • 7
  • 16
  • 1
    The result is a? Not aa? That's very odd. – user253751 Apr 10 '15 at 11:06
  • 1
    My output for the second variant is `a a` (three spaces). – Seelenvirtuose Apr 10 '15 at 11:15
  • In my case the result is only "a" with a carriage return – lecogiteur Apr 10 '15 at 11:59
  • 1
    Just to complement the answers: If you loop over several byte arrays trying to decode them independently, you must deal with the problem that some byte together form a character but are split apart into two of those byte arrays. Decoding with the mentioned method will decode as much as possible and then you will get a `CoderResult.UNDERFLOW`. That exception simply means that one or several few bytes are not decoded and must be added in front of the next byte array for the loop. That's it. – Seelenvirtuose Apr 10 '15 at 13:01
  • So I suggest to close this question, because, it seems, that OP doesn't understand elementary things about how `byte-to-char` conversion works – Andremoniy Apr 10 '15 at 13:02
  • @Andremoniy i know elementary converstion byte-to-char. I know that some char contains multiple bytes. And i'm aware that bytes which form a character can be split on two byte array in my program. But i have suppose that method `decode` could be managed this problem. But it seems that not. – lecogiteur Apr 10 '15 at 15:19
  • @Seelenvirtuose Thanks for your response. It seems `CoderResult.UNDERFLOW` is a part of answer. In fact, if you have byte like 97 (a) the method decode return `CoderResult.UNDERFLOW` and add _a_ to `CharBuffer` if boolean endOfInput is false... – lecogiteur Apr 10 '15 at 15:22

2 Answers2

2

The method decode(ByteBuffer, CharBuffer, boolean) returns a result, but you are ignoring the result. If print the result in your second code fragment:

for (int i = 0; i < tab.length; i++) {
    ByteBuffer buffer = ByteBuffer.wrap(tab, i, 1);
    System.out.println(dec.decode(buffer, chars, i == 2));
}

you'll see this output:

UNDERFLOW
MALFORMED[1]
MALFORMED[1]
a   a

Apparently it does not work correctly if you start decoding in the middle of a character. The decoder expects that the first thing it reads is the start of a valid UTF-8 sequence.

edit - When the decoder reports UNDERFLOW, it expects you to add more data to the input buffer and then try to call decode() again, but you must re-offer it the data from the start of the UTF-8 sequence that you are trying to decode. You can't continue in the middle of an UTF-8 sequence.

Here is a version that works, adding one byte from tab in every iteration of the loop:

final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
final CharBuffer chars = CharBuffer.allocate(3);
final byte[] tab = new byte[]{(byte) -30, (byte) -126, (byte) -84}; //char €

final ByteBuffer buffer = ByteBuffer.allocate(10);

for (int i = 0; i < tab.length; i++) {
    // Add the next byte to the buffer
    buffer.put(tab[i]);

    // Remember the current position
    final int pos = buffer.position();

    // Try to decode
    buffer.flip();
    final CoderResult result = dec.decode(buffer, chars, i == 2);
    System.out.println(result);

    if (result.isUnderflow()) {
        // Underflow, prepare the buffer for more writing
        buffer.limit(buffer.capacity());
        buffer.position(pos);
    }
}

dec.flush(chars);
chars.flip();

System.out.println("a" + chars.toString() + "a");
Jesper
  • 202,709
  • 46
  • 318
  • 350
  • You haven't provide answer to the question: **how use this method** to **archive result**. – Andremoniy Apr 10 '15 at 11:34
  • @Andremoniy So, you're voting it down out of revenge, because I remarked that your answer is wrong? – Jesper Apr 10 '15 at 11:36
  • I've downvoted it because it is not an answer. If I was OP - I would not understand how to fix my code from you answer – Andremoniy Apr 10 '15 at 11:37
  • 1
    I have the same result. But i don't understand your answer. What do you mean: "Apparently it does not work correctly if you start decoding in the middle of a character. " – lecogiteur Apr 10 '15 at 12:34
  • 3
    @lecogiteur The Euro character takes up 3 bytes in UTF-8. You are giving the decoder the bytes one by one in your second example, instead of giving it all 3 bytes at once. After the first byte it says `UNDERFLOW`, which means it needs more bytes to decode a character. But at the second byte it says `MALFORMED` - because that second byte is not the beginning of a valid UTF-8 byte sequence. – Jesper Apr 10 '15 at 12:53
  • @Jesper So, if i understand you, the decoder doesn't "stock " byte which can't decode. The "flush" method is useless. For you the solution: i must to allocate a new ByteBuffer and add byte by byte to it and decode the complete ByteBuffer in once. Like in my first example of code. In this case, what is this method decode? What use? – lecogiteur Apr 10 '15 at 13:01
  • @Jesper Ok, thanks. I do some tests. Your doesn't work with this tab: `{(byte)97, (byte)-30, (byte)-126, (byte)-84, (byte)97}`. The result is `aaaa€a` instead of `a€a`. I think the condition `result.isUnderflow()` is not enought. It seems when the boolean `endOfInput` is false the method `decode` return always underflow. In this condition, the method `flush` is not necessary. I'm surprise that class `CharsetDecoder` doesn't manage nativly this. But with your code i can read the start of string before i receive all byte. It's what i want – lecogiteur Apr 10 '15 at 15:05
  • @lecogiteur I tried it with that tab and indeed get the same result as you. I think the conclusion is: you must make sure that the buffer does not contain an incomplete UTF-8 sequence when you call `decode`. – Jesper Apr 10 '15 at 18:27
  • @Jesper no you don't need to make decisions about the input using as much procedural knowledge as that of the decoder itself, just don't throw away the buffer position on return. – SensorSmith Apr 08 '20 at 01:03
1

The decoder does not internally cache the data from partial characters, but this does not mean that you have to do complicated things to figure out what data to re-feed the decoder. You gave it a clear way to represent what data it actually consumed, i.e. the input ByteBuffer and its position. In the second example, by giving it a new ByteBuffer every time, the OP failed to pass the decoder back what it reported it had not yet consumed.

The standard pattern for using NIO Buffers is input, flip, output, compact, loop. Short of optimization (which may be premature), there is no reason to re-implement compact yourself. You might just get it wrong, like @Jesper and @lecogiteur did (if more than a single character was ever presented). You should NOT be resetting to the position from before the decode call.

The second example should have read something like:

    final CharsetDecoder dec = Charset.forName("UTF-8").newDecoder();
    final CharBuffer chars = CharBuffer.allocate(3);
    final byte[] tab = new byte[]{(byte)-30, (byte)-126, (byte)-84}; //char €
    final ByteBuffer buffer = ByteBuffer.wrap(new byte[3]);

    for (int i=0; i<tab.length; i++){
        b.put(tab, i, 1);  // In actual usage some type of IO read/transfer would occur here
        b.flip();
        dec.decode(buffer, chars, i == 2);
        b.compact();
    }
    dec.flush(chars);
    System.out.println("a" + chars.toString() + "a");

NOTE: The above does not check the return value to detect malformed input or other error handling for running safely on arbitrary input/IO conditions.

SensorSmith
  • 1,129
  • 1
  • 12
  • 25