1

I have a byte array of size 8. I am converting it to string using the following code. (See below).

Now, when I convert the string again to byte[] using getBytes method, the result is absurd, which is a 16-sized byte[] with only a few (2 or 3) matching bytes to the previous byte array. Can someone tell me where I am going wrong?

byte[] message = new byte[8];
//initialize message
printBytes("message: " + message.length + " = ", message);
try {
    String test = new String(message, "utf-8");
    System.out.println(test);
    byte[] f = test.getBytes("utf-8");
    Help.printBytes("test = " + f.length, f);
} catch (UnsupportedEncodingException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
}

printBytes function:

public static void printBytes(String msg, byte[] b){
    System.out.print(msg + " = ");
    for(int i = 0; i < b.length; i++){
        System.out.print("" + String.format("%02X", b[i]));
    }
    System.out.println("\n");
}

Output:

message: 8 =  = 9A52D5D6C6E999AD

�R���陭
test = 16 = EFBFBD52EFBFBDEFBFBDEFBFBDE999AD
Cœur
  • 37,241
  • 25
  • 195
  • 267
vish4071
  • 5,135
  • 4
  • 35
  • 65
  • Because the string encoding Java uses is not 8 bit, its 16 bit. May be UNICODE or UTF. not sure which. – Jos Sep 23 '15 at 13:35
  • Also, I dont think converting byte arrays to string is a good idea, if you want it to be reproduced as byte arrays. You can try converting to Hex String instead. – Jos Sep 23 '15 at 13:36
  • But while doing the reverse, it shouldalso be using the same encodeing. I should, anyway, get the expected result. – vish4071 Sep 23 '15 at 13:36
  • 1
    You never show how you build the original array, but it sure looks like it doesn't contain valid UTF-8 bytes. – Kayaman Sep 23 '15 at 13:37
  • Actually, my use case is this: I want to convert byte[] to string and then DES encrypt it. I'm using this code to do it: http://stackoverflow.com/questions/20227/how-do-i-use-3des-encryption-decryption-in-java – vish4071 Sep 23 '15 at 13:38
  • It contains bytes all in range 0-256, @Kayaman. I'm sure there is no problem there. – vish4071 Sep 23 '15 at 13:40
  • @vish4071 This is not how it works, you FIRST convert text/String to bytes THEN you encode it. When decoding you get bytes which you convert back to text/String. – A4L Sep 23 '15 at 13:40
  • 1
    @vish4071 Then obviously you don't know UTF-8. – Kayaman Sep 23 '15 at 13:42
  • Yes, @Kayaman. I have not exactly studies `utf-8`. See the link I mentioned above. In that code, it uses the utf-8 format for encoding/ decoding. So, I thought I'd use the same. – vish4071 Sep 23 '15 at 13:47
  • Can anyone tell how can I DES encrypt the byte[] (ie. message). My usecase is (See above comment) – vish4071 Sep 23 '15 at 13:51
  • Just: byte[] encrypted = encrypt("Hallo World".getBytes("UTF-8")); I can't see any try to encrypt in your code or any payload/content in the variable "message". – xoned Sep 23 '15 at 14:18
  • @Timo, I don't have string to be encrypted. I only have a byte array, `message` and I want to encrypt that. – vish4071 Sep 23 '15 at 14:20

2 Answers2

6

Your original byte[] had illegal byte sequences (that is, sequences that don't form valid UTF-8 characters). This has unspecified behavior for the String(byte[], String) constructor, but in your implementation, these bad bytes are replaced by the "�" characters, which is \uFFFD -- a three-byte character in UTF-8. You seem to have four of these, which account for 12 bytes right there.

yshavit
  • 42,327
  • 7
  • 87
  • 124
  • The way I'm initializing `byte[]` is selecting a random value between 0-255 and assign it to `byte[i]`, where i goes from [0,8) – vish4071 Sep 23 '15 at 13:41
  • @vish4071 You can't assign the values randomly. UTF8 has rules you need to follow. – Kayaman Sep 23 '15 at 13:43
  • @vish4071 Well... don't do that, if you want to encode your `byte[]` as a utf-8 string. Not all byte sequences are valid in all encodings. If you're trying to "Stringify" the bytes for serialization or similar, I would suggest base64 or something like that. – yshavit Sep 23 '15 at 13:43
  • See one of my comments in post, which is my use-case. Can anyone tell how can I DES encrypt the byte[] (ie. message) – vish4071 Sep 23 '15 at 13:44
  • That would be a different question. :) It looks like A4L provided you a good direction, but SO is for specific, targeted questions -- so we shouldn't let this question morph into something bigger. – yshavit Sep 23 '15 at 13:46
  • I understand that @yshavit, but at the same time, I framed a similar question before but I was told that it was too broad and could not be answered like that – vish4071 Sep 23 '15 at 13:50
-1
new String(message, "utf-8");

This code tells the string object, that your message utf-8 encoded is.

test.getBytes("utf-8");

This code means, give me the bytes of string and encode as utf-8 encoded string. The result is, your string will be double utf-8 encoded.

Do once code, only.

String test = new String(message, "utf-8");
test.getBytes();

Sample for double encoded strings:

public class Test {

    public static void main(String[] args) {
        try {
            String message = "äöü";
            Test.printBytes("java internal encoded: = ", message.getBytes());
            Test.printBytes("utf-8 encoded: = ", message.getBytes("utf-8"));
            // get the string utf-8 encoded and create a new string with the
            // utf-8 encoded content
            message = new String(message.getBytes("utf-8"), "utf-8");
            Test.printBytes("test get bytes without charset: = ", message.getBytes());
            Test.printBytes("test get bytes with charset: = ", message.getBytes("utf-8"));
            System.out.println(message);
            System.out.println("double encoded: " + new String(message.getBytes("utf-8")));
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void printBytes(String msg, byte[] b) {
        System.out.print(msg + " = ");
        for (int i = 0; i < b.length; i++) {
            System.out.print("" + String.format("%02X", b[i]));
        }
        System.out.println("\n");
    }

}

Ouput:

java internal encoded: =  = E4F6FC
utf-8 encoded: =  = C3A4C3B6C3BC
test get bytes without charset: =  = E4F6FC
test get bytes with charset: =  = C3A4C3B6C3BC

äöü
double encoded: äöü <-- the java internal encoding is not converted to utf-8, it is double encoded
TwilightTitus
  • 190
  • 1
  • 9
  • Wrong. If you don't pass the encoding as a parameter, the platform default encoding will be used. Since this can cause problems when suddenly you have different platforms with different encodings, you should **always** explicitly tell the encoding used. – Kayaman Sep 23 '15 at 13:49
  • *"your string will be double utf-8 encoded"* Sorry, but this sounds funny :P. – Tom Sep 23 '15 at 13:50
  • Look at my post and you see double encoded strings! – TwilightTitus Sep 23 '15 at 14:20
  • 1
    @TwilightTitus Your "Java internal encoded" is actually "characters encoded with the default platform encoding" (which on my platforms is UTF-8), and your "double encoded" is actually "this is what UTF-8 data looks like when you pretend it's ISO-8859-1". It's not double encoding, it's **wrong** encoding. – Kayaman Sep 24 '15 at 10:42