Problems Converting Between ByteBuffer and String in Java

Question

I'm currently developing an application where users can edit a ByteBuffer via a hex editor interface and also edit the corresponding text through a JTextPane. My current issue is because the JTextPane requires a String I need to convert the ByteBuffer to a String before displaying the value. However, during the conversion invalid characters are replaced by the charsets default replacement character. This squashes the invalid value so when I convert it back to a byte buffer the invalid characters value is replace by the byte value of the default replacement character. Is there an easy way to retain the byte value of an invalid character in a string? I've read the following stackoverflow posts but usually folks want to just replace unprintable characters, I need to preserve them.

Java ByteBuffer to String

Java: Converting String to and from ByteBuffer and associated problems

Is there an easy way to do this or do I need to keep track of all the changes that happen in the text editor and apply them to the ByteBuffer?

Here is code demonstrating the problem. The code uses byte[] instead of ByteBuffer but the issue is the same.

        byte[] temp = new byte[16];
        // 0x99 isn't a valid UTF-8 Character
        Arrays.fill(temp,(byte)0x99);

        System.out.println(Arrays.toString(temp));
        // Prints [-103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103]
        // -103 == 0x99

        System.out.println(new String(temp));
        // Prints ����������������
        // � is the default char replacement string

        // This takes the byte[], converts it to a string, converts it back to a byte[]
        System.out.println(Arrays.toString(new String(temp).getBytes()));
        // I need this to print [-103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103, -103]
        // However, it prints
        //[-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67]
        // The printed byte is the byte representation of �

I think this needs code. Sounds like a bug. Could also be a conceptual error: what exact text sequence(s) you are having trouble converting to bytes? — markspace, Oct 02 '16 at 20:10
I've updated the question to include code showing the issue. This isn't a bug in my code, it's a supposed to work this way by default. — Justin Moore, Oct 02 '16 at 20:28

score 1 · Accepted Answer · answered Oct 02 '16 at 21:03

Especially UTF-8 will go wrong

    byte[] bytes = {'a', (byte) 0xfd, 'b', (byte) 0xe5, 'c'};
    String s = new String(bytes, StandardCharsets.UTF_8);
    System.out.println("s: " + s);

One need a CharsetDecoder. There one can ignore (=delete) or replace the offending bytes, or by default: let an exception be thrown.

For the JTextPane we use HTML, so we can write the hex code of the offending byte in a <span> giving it a red background.

    ByteBuffer byteBuffer = ByteBuffer.wrap(bytes);
    CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
    CharBuffer charBuffer = CharBuffer.allocate(bytes.length * 50);
    charBuffer.append("<html>");
    for (;;) {
        try {
            CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
            if (!result.isError()) {
                break;
            }
        } catch (RuntimeException ex) {
        }
        int b = 0xFF & byteBuffer.get();
        charBuffer.append(String.format(
            "<span style='background-color:red; font-weight:bold'> %02X </span>",
            b));
        decoder.reset();
    }
    charBuffer.rewind();
    String t = charBuffer.toString();
    System.out.println("t: " + t);

The code does not reflect a very nice API, but play with it.

That's a really good idea that I hadn't even considered. The only problem I see with this is there's going to be a ton of additional markup residing in the text of the JTextPane when I convert it back from the String to a byte[]. Do you have any ideas on how to get around that? — Justin Moore, Oct 02 '16 at 21:07
A `replaceAll("<[^>]*>", "")` or better a loop with a Pattern Matcher. — Joop Eggen, Oct 02 '16 at 21:09
A JTextPane would also allow to use styled text (StyledDocument) and use attributes separate of the text, but that is cumbersome, especially if you want to allow editing. But you may use `byteBuffer.position()` to mark those bytes. — Joop Eggen, Oct 02 '16 at 21:11
I think this approach might be the best to satisfy my needs for this specific project. I was hoping there was some easier I could do but this will probably have to do. Thanks! — Justin Moore, Oct 02 '16 at 21:13

bmargulies · Answer 2 · 2016-10-02T20:47:24.550

0

What do you think that new String(temp).getBytes() will do for you?

I can tell you that it does something BAD.

It converts temp to a String using the default encoding, which is probably wrong, and may lose information.
It converts the results back to a byte array, using the default encoding.

To turn a byte[] into a String, you must always pass a Charset into the String constructor, or else use a decoder directly. Since you are working from buffers, you might find the decoder API congenial.

To turn a String into a byte[], you must always call getBytes(Charset) so that you know that you're using the correct charset.

Based on comments, I am now suspecting that your problem here is that you need to be writing code something like the following to convert from bytes to hex for your UI. (and then something corresponding to get back.)

String getHexString(byte[] bytes) {
    StringBuilder builder = new StringBuilder();
    for (byte b : bytes) {
       int nibble = b >> 4;
       builder.append('0' + nibble);
       nibble = b & 0xff;
       builder.append('0' + nibble);
    }
    return builder.toString();
}

edited Oct 02 '16 at 20:47

answered Oct 02 '16 at 20:34

bmargulies

97,814
39
186
310

I understand that best practice dedicts that both getBytes and the String constructor should take a Charset. The issue still exists if I pass a Charset into the String constructor. `new String (temp, "UTF-8")` throws an `UnsupportedEncodingException` exception because the `byte[]` contains unmappable characters by design. I feel that the answer is going to need to use the CharsetDecoder API, but I haven't seen any examples using it for something similar. – Justin Moore Oct 02 '16 at 20:43
If it contains non-UTF-8, you may not convert it to a string if you want to keep all the information. You need to convert each `byte` to two hex digits; there's no way to do that with the APIs you are using. – bmargulies Oct 02 '16 at 20:44
@JustinA.Moore So, now that we've found the conceptual error/bug, what *exactly do you want to do with unmappable characters.* They are, by definition, unmappable, so you have to have some plan for them that's outside of `Charset`'s perview. – markspace Oct 02 '16 at 20:46
They can be printed inside the JTextArea as anything (An empty space, the � character from above, whatever really. They don't have a character associated with them), I just need the underlying byte to stay the same when the String is converted back to a byte[] or ByteBuffer. – Justin Moore Oct 02 '16 at 20:54
You can't have that unless you write a custom Charset. There are no charsets that provide round-trip of all possible byte values. _something_ will be mapped to a substitution character, and thus get lost, always. – bmargulies Oct 02 '16 at 21:00
Yeah, that makes sense. I think I'm going to need to go with the above answer and dictate which characters inside the are actually bytes by marking them with html. I was hoping something along the lines of what you both were suggesting existed but I don't think that's the case. Thanks for the input, I really appreciate it. – Justin Moore Oct 02 '16 at 21:15

Problems Converting Between ByteBuffer and String in Java

2 Answers2