No sense length() result

Question

Since today I'm fronting a really weird error related to byte[] to String conversion.

Here is the code:

private static final byte[] test_key = {-112, -57, -45, 125, 91, 126, -118, 13, 83, -60, -119, 57, 38, 118, -115, -52, -92, 39, -24, 75, 59, -21, 88, 84, 66, -125};

public static void main(String[] args) {
    byte[] encryptedArray = xor("ciao".getBytes(), test_key);

    System.out.println("Encrypted arrray: " + Arrays.toString(encryptedArray));
    final String encrypted = new String(encryptedArray);

    System.out.println("Length: " + new String(encryptedArray).length());
    System.out.println(Arrays.toString(encrypted.getBytes()));

    System.out.println("Encrypted value: " + encrypted);
    System.out.println("Decrypted value: " + new String(xor(encrypted.getBytes(), test_key)));
}

private static byte[] xor(byte[] data, byte[] key) {
    byte[] result = new byte[data.length];
    for (int i = 0; i < data.length; i++) {
        result[i] = (byte) (data[i] ^ key[i % key.length]);
    }
    return result;
}

My output is:

Encrypted arrray: [-13, -82, -78, 18]
Length: 2
[-17, -65, -67, 18]
Encrypted value: �
Decrypted value: xno

Why does length() return 2? What am I missing?

When converting your ciphertext to a string and back to a `byte[]` you apply a charset encoding. Since you don't define a special encoding (which is bad btw, an encoding should always be specified!), the default encoding is used, which obviously corrupts the ciphertext. I can reproduce your result with UTF-8 encoding. For the conversion of arbitrary binary data (like a ciphertext) a binary-to-text encoding like Base64 must be applied. Alternatively, use the binary data, i.e. directly decrypt `encryptedArray` instead of `encrypted.getBytes()`. — Topaco, Jul 24 '21 at 12:02
There are no ways to get the correct output without using base64 and without decrypting `encryptedArray` directly? — Princekin, Jul 24 '21 at 12:19
You can also apply a charset encoding with a 1:1 mapping between bytes and characters, e.g. ISO-8859-1. Have a look [here](https://stackoverflow.com/a/9098905). For this you must either set the default encoding accordingly or specify the encoding explicitly for each encoding (`getBytes()`) / decoding (`new String()`). But this is imo more a workaround than a long term solution. — Topaco, Jul 24 '21 at 12:33
Java is not C. Don’t use a String to hold a sequence of arbitrary bytes. That’s what a byte array is for. — VGR, Jul 24 '21 at 13:54

score 4 · Accepted Answer · answered Jul 24 '21 at 12:27

There is no 1-to-1 mapping between byte and char, rather it depends on the charset you use. Strings are logically chars sequences. So if you want to convert between chars and bytes, you need a character encoding, which specifies the mapping from chars to bytes, and vice versa. Your bytes in encryptedArray are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes.

If you want to use String and revert back the exact bytes, you need to do a Base64 of the encryptedArray and then do a new String() of it:

String encoded = new String(Base64.getEncoder().encode(encryptedArray));

To retreive, just decode:

Base64.getDecoder().decode(encoded);

Maarten Bodewes · Answer 2 · 2021-07-24T15:04:34.527

I just thought of a good way of showing what happens by simply replacing the new String(byte[]) method by another one, which is why I will answer the question. This one performs the same basic action as the constructor, with one change: it throws an exception if any invalid characters are found.

private static final byte[] test_key = {-112, -57, -45, 125, 91, 126, -118, 13, 83, -60, -119, 57, 38, 118, -115, -52, -92, 39, -24, 75, 59, -21, 88, 84, 66, -125};

public static void main(String[] args) throws Exception {
    byte[] encryptedArray = xor("ciao".getBytes(), test_key);

    System.out.println("Encrypted arrray: " + Arrays.toString(encryptedArray));
    final String encrypted = new String(encryptedArray);

    // original
    System.out.println("Length: " + new String(encryptedArray).length());
    
    // replacement
    System.out.println("Length: " + decode(encryptedArray).length());
    
    
    System.out.println(Arrays.toString(encrypted.getBytes()));

    System.out.println("Encrypted value: " + encrypted);
    System.out.println("Decrypted value: " + new String(xor(encrypted.getBytes(), test_key)));
}

private static String decode(byte[] encryptedArray) throws CharacterCodingException {
    var decoder = Charset.defaultCharset().newDecoder();
    decoder.onMalformedInput(CodingErrorAction.REPORT);
    var decoded = decoder.decode(ByteBuffer.wrap(encryptedArray));
    return decoded.toString();
}

private static byte[] xor(byte[] data, byte[] key) {
    byte[] result = new byte[data.length];
    for (int i = 0; i < data.length; i++) {
        result[i] = (byte) (data[i] ^ key[i % key.length]);
    }
    return result;
}

The method is called decode because that's what you are actually doing: you are decoding the bytes to a text. A character encoding is the encoding of characters as bytes, which means that the opposite must be decoding after all.

As you will see, the above will first print out 2 if your platform uses the default UTF-8 encoding (Linux, Android, MacOS). You can get the same result by replacing Charset.defaultCharset() with StandardCharsets.UTF_8 on Windows which uses the Windows-1252 charset instead (a single byte encoding which is an expansion of Latin-1, which itself is an expansion of ASCII). However, it will generate the following exception if you use the decode method:

java.nio.charset.MalformedInputException: Input length = 3
    at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
    at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:815)
    at StackExchange/com.stackexchange.so.ShowBadEncoding.decode(ShowBadEncoding.java:36)
    at StackExchange/com.stackexchange.so.ShowBadEncoding.main(ShowBadEncoding.java:24)

Now maybe you'd expect 4 here, the size of the byte array. But note that UTF-8 characters may be encoded over multiple bytes. The error occurs not on the entire string, but on the last character it is trying to read. Obviously it is expecting a longer encoding based on the previous byte values.

If you replace REPORT with the default decoding action REPLACE (heh) you will see that the result is identical to the constructor, and length() will now return the value 2 again.

Of course, Topaco is correct when he says you need to use base 64 encoding. This encodes bytes to characters instead so that all of the meaning of the bytes is maintained, and the reverse is of course the decoding of text back to bytes.

score 1 · Answer 3 · answered Jul 24 '21 at 12:31

The elements of a String are not bytes, they are chars. A char is not a byte.

There are many ways of converting a char to a sequence of bytes (i.e., many character-set encodings).

Not every sequence of chars can be converted to a sequence of bytes; there is not always a mapping for every char. It depends on your chosen character-set encoding.

Not every sequence of bytes can be converted to a String; the bytes have to be syntactically valid for the specified character set.

No sense length() result

3 Answers3