Issue with java, String.getBytes method

Question

I have a byte array of size 8. I am converting it to string using the following code. (See below).

Now, when I convert the string again to byte[] using getBytes method, the result is absurd, which is a 16-sized byte[] with only a few (2 or 3) matching bytes to the previous byte array. Can someone tell me where I am going wrong?

byte[] message = new byte[8];
//initialize message
printBytes("message: " + message.length + " = ", message);
try {
    String test = new String(message, "utf-8");
    System.out.println(test);
    byte[] f = test.getBytes("utf-8");
    Help.printBytes("test = " + f.length, f);
} catch (UnsupportedEncodingException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
}

printBytes function:

public static void printBytes(String msg, byte[] b){
    System.out.print(msg + " = ");
    for(int i = 0; i < b.length; i++){
        System.out.print("" + String.format("%02X", b[i]));
    }
    System.out.println("\n");
}

Output:

message: 8 =  = 9A52D5D6C6E999AD

�R���陭
test = 16 = EFBFBD52EFBFBDEFBFBDEFBFBDE999AD

Because the string encoding Java uses is not 8 bit, its 16 bit. May be UNICODE or UTF. not sure which. — Jos, Sep 23 '15 at 13:35
Also, I dont think converting byte arrays to string is a good idea, if you want it to be reproduced as byte arrays. You can try converting to Hex String instead. — Jos, Sep 23 '15 at 13:36
But while doing the reverse, it shouldalso be using the same encodeing. I should, anyway, get the expected result. — vish4071, Sep 23 '15 at 13:36
You never show how you build the original array, but it sure looks like it doesn't contain valid UTF-8 bytes. — Kayaman, Sep 23 '15 at 13:37
Actually, my use case is this: I want to convert byte[] to string and then DES encrypt it. I'm using this code to do it: http://stackoverflow.com/questions/20227/how-do-i-use-3des-encryption-decryption-in-java — vish4071, Sep 23 '15 at 13:38
It contains bytes all in range 0-256, @Kayaman. I'm sure there is no problem there. — vish4071, Sep 23 '15 at 13:40
@vish4071 This is not how it works, you FIRST convert text/String to bytes THEN you encode it. When decoding you get bytes which you convert back to text/String. — A4L, Sep 23 '15 at 13:40
Yes, @Kayaman. I have not exactly studies `utf-8`. See the link I mentioned above. In that code, it uses the utf-8 format for encoding/ decoding. So, I thought I'd use the same. — vish4071, Sep 23 '15 at 13:47
Can anyone tell how can I DES encrypt the byte[] (ie. message). My usecase is (See above comment) — vish4071, Sep 23 '15 at 13:51
Just: byte[] encrypted = encrypt("Hallo World".getBytes("UTF-8")); I can't see any try to encrypt in your code or any payload/content in the variable "message". — xoned, Sep 23 '15 at 14:18
@Timo, I don't have string to be encrypted. I only have a byte array, `message` and I want to encrypt that. — vish4071, Sep 23 '15 at 14:20

yshavit · Accepted Answer · 2015-09-23T13:44:06.620

6

Your original byte[] had illegal byte sequences (that is, sequences that don't form valid UTF-8 characters). This has unspecified behavior for the String(byte[], String) constructor, but in your implementation, these bad bytes are replaced by the "�" characters, which is \uFFFD -- a three-byte character in UTF-8. You seem to have four of these, which account for 12 bytes right there.

edited Sep 23 '15 at 13:44

answered Sep 23 '15 at 13:39

yshavit

42,327
7
87
124

The way I'm initializing `byte[]` is selecting a random value between 0-255 and assign it to `byte[i]`, where i goes from [0,8) – vish4071 Sep 23 '15 at 13:41
@vish4071 You can't assign the values randomly. UTF8 has rules you need to follow. – Kayaman Sep 23 '15 at 13:43
@vish4071 Well... don't do that, if you want to encode your `byte[]` as a utf-8 string. Not all byte sequences are valid in all encodings. If you're trying to "Stringify" the bytes for serialization or similar, I would suggest base64 or something like that. – yshavit Sep 23 '15 at 13:43
See one of my comments in post, which is my use-case. Can anyone tell how can I DES encrypt the byte[] (ie. message) – vish4071 Sep 23 '15 at 13:44
That would be a different question. :) It looks like A4L provided you a good direction, but SO is for specific, targeted questions -- so we shouldn't let this question morph into something bigger. – yshavit Sep 23 '15 at 13:46
I understand that @yshavit, but at the same time, I framed a similar question before but I was told that it was too broad and could not be answered like that – vish4071 Sep 23 '15 at 13:50

TwilightTitus · Answer 2 · 2015-09-23T14:35:16.210

new String(message, "utf-8");

This code tells the string object, that your message utf-8 encoded is.

test.getBytes("utf-8");

This code means, give me the bytes of string and encode as utf-8 encoded string. The result is, your string will be double utf-8 encoded.

Do once code, only.

String test = new String(message, "utf-8");
test.getBytes();

Sample for double encoded strings:

public class Test {

    public static void main(String[] args) {
        try {
            String message = "äöü";
            Test.printBytes("java internal encoded: = ", message.getBytes());
            Test.printBytes("utf-8 encoded: = ", message.getBytes("utf-8"));
            // get the string utf-8 encoded and create a new string with the
            // utf-8 encoded content
            message = new String(message.getBytes("utf-8"), "utf-8");
            Test.printBytes("test get bytes without charset: = ", message.getBytes());
            Test.printBytes("test get bytes with charset: = ", message.getBytes("utf-8"));
            System.out.println(message);
            System.out.println("double encoded: " + new String(message.getBytes("utf-8")));
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void printBytes(String msg, byte[] b) {
        System.out.print(msg + " = ");
        for (int i = 0; i < b.length; i++) {
            System.out.print("" + String.format("%02X", b[i]));
        }
        System.out.println("\n");
    }

}

Ouput:

java internal encoded: =  = E4F6FC
utf-8 encoded: =  = C3A4C3B6C3BC
test get bytes without charset: =  = E4F6FC
test get bytes with charset: =  = C3A4C3B6C3BC

äöü
double encoded: Ã¤Ã¶Ã¼ <-- the java internal encoding is not converted to utf-8, it is double encoded

Wrong. If you don't pass the encoding as a parameter, the platform default encoding will be used. Since this can cause problems when suddenly you have different platforms with different encodings, you should **always** explicitly tell the encoding used. — Kayaman, Sep 23 '15 at 13:49
*"your string will be double utf-8 encoded"* Sorry, but this sounds funny :P. — Tom, Sep 23 '15 at 13:50
@TwilightTitus Your "Java internal encoded" is actually "characters encoded with the default platform encoding" (which on my platforms is UTF-8), and your "double encoded" is actually "this is what UTF-8 data looks like when you pretend it's ISO-8859-1". It's not double encoding, it's **wrong** encoding. — Kayaman, Sep 24 '15 at 10:42

Issue with java, String.getBytes method

2 Answers2