3
        byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
        String s= new String(arr);
        Arrays.equals(arr, s.getBytes()));  // returns false

Why are the arrays not equal? I would expect getBytes() to return the original byte array.

asim
  • 73
  • 5
  • I suggest inspecting the arrays manually. What do you get if you print `s.getBytes()`? Is it the same as the original `arr`? – Code-Apprentice Apr 18 '22 at 15:10
  • `56 99 87 77 73 90 105 -23 -52 -85 -9 -55 -115 11 -127 -127` `56 99 87 77 73 90 105 -23 -52 -85 -9 -55 -115 11 63 63` – asim Apr 18 '22 at 15:12
  • Are negative bytes valid when using the default character encoding on your system? – Code-Apprentice Apr 18 '22 at 15:13
  • 3
    Negative values are not really relevant. Character encodings don't care. The result will depend on what character encoding is used. Running it with `-Dfile.encoding=Latin1` will 'work' – g00se Apr 18 '22 at 15:15
  • 1
    Its all about which charset these characters represented by these negative numbers belong to. If you know it, it will work – Chetan Ahirrao Apr 18 '22 at 15:22
  • 1
    As will GB18030 GBK IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM857 IBM862 IBM866 ISO-8859-1 ISO-8859-13 ISO-8859-15 ISO-8859-16 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-7 ISO-8859-9 KOI8-R KOI8-U windows-1251 x-IBM737 x-iso-8859-11 – g00se Apr 18 '22 at 15:22
  • Both `new String(bytes[])` and `getBytes()` use the `Charset.defaultCharset()` – Rob Audenaerde Apr 18 '22 at 15:24
  • 1
    @ChetanAhirrao *If you know it, it will work* Knowing *alone* won't help you. You need to set that `file.encoding` in the system or use the correct argument to `getBytes` – g00se Apr 18 '22 at 15:24
  • Btw. Thanks, this makes a great java puzzler ;) – Rob Audenaerde Apr 18 '22 at 15:34
  • Of course, starting in JDK 18 UTF-8 is going to be the [default character set](https://openjdk.java.net/jeps/400), so the results will also depend on what Java version you are using. Basically, never use the `new String(byte[])` or `getBytes()` methods; always use the overloads that let you specify the charset. – David Conrad Apr 18 '22 at 16:44

3 Answers3

5

You seem to think that bytes and characters are interchangeable.

They simply are not.

To turn characters into bytes, you 'encode' the characters using a 'charset encoding'. To turn bytes back into characters, you decode them using a 'charset encoding'. There is no such thing as converting one to the other without a charset encoding.

The transition bytes->chars->bytes is only 'perfect' (guaranteed to always give you the same byte array back) for a select few encoding systems. Most encoding systems do not have this property. An encoding system that does, is ISO-8859-1. However, the 2 most common encodings do not have this property: Neither UTF-8 nor US-ASCII gets the job done.

The methods you use here (both str.getBytes as well as new String(byteArr)) use the 'platform default encoding'. Starting with JDK18, that's guaranteed to be UTF-8 (thus guaranteeing that this will not work properly), and before that, it's whatever your system's default encoding is, which we don't know.

US-ASCII doesn't work because US_ASCII only defines a subset of all bytes as 'valid': 0-126. Most of your bytes (all of them with a minus sign) aren't valid ASCII.

UTF-8 doesn't work because not all byte sequences are valid UTF-8. In other words, there are sequences of bytes that simply cannot be produced with UTF_8.

More to the point though, the entire principle is just broken. Even if you know it's ISO-8859-1, what are you trying to accomplish by doing this? You may be able to translate an arbitrary byte array into ISO-8859-1 and back again without losing anything, but what point does this serve? You can easily produce strings that cause havoc, with NUL characters, tabs, backspaces, 'bell' sounds, and other bizarreness. It's a string you'd never ever want to print. Which asks the question: Why do you want one, then?

There really is only one sensible answer to that question, and that is: I wish to transport these bytes through a medium that only supports strings. For example, I have some raw bytes, and I want to put them in an email, or in a form field for a jira ticket or something silly like that, and an attachment is for some reason not an option in this. Or I want to stuff it into a URL (https://www.foo.bar/?q=raw-bytes-here).

There are 2 answers to doing that, and neither involve new String(byteArr):

Nibbles

Any raw byte can trivially be turned into hexadecimal representation: 255 (or -1, in signed byte form, it's the same thing) turns into FF. 1 turns into 01 - all bytes are always exactly 2 characters in length. You can use:

byte f = -1;
String nibbled = String.format("%02X", (int) f);
System.out.println(nibbled); // prints 'FF'

The individual letter/digit (0-9A-F. Technically that's just a digit, in hexadecimal, where A-F are also digits) is called a 'nibble' (because it's half a byte, see. Boy, the 60s when these terms were invented were a hoot weren't they).

This is somewhat inefficient; a byte array of X bytes turns into a string of 2*X characters (and each character may well take 2 bytes, e.g. if it is UTF-16 encoded, for a total of 25% efficiency, ouch). But it is trivially readable and common. It's great for short (sub 500 or so bytes) byte arrays.

Another advantage is that you can eyeball the string and know what the data is, if you can read hexadecimal and, if signed is relevant, 2's complement, which is not too difficult.

Base64

Base64 is a simple encoding scheme that defines 64 'safe' characters that you know will safely 'survive' without getting mangled or misinterpreted. That gives you 6 bits of data per character. Bytes are 8, so, you can 'stuff' 3 bytes into 4 characters this way; for example a 900 byte array turns into 1200 characters.

Java has base64 encoding/decoding built in.

byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
String s = Base64.getEncoder().encodeToString(arr);
// s is all ASCII chars and safe to include just about everywhere.
// URL parameter, emails, web forms, you name it.
byte[] arr2 = Base64.getDecoder().decode(s);
Arrays.equals(arr, arr2); // true, guaranteed.

Base64 is slightly more complicated, and you can no longer eyeball a base64 string and just see the bytes matrix-style. But it is more efficient than nibble form: 75% efficiency (or 37.5% if the underlying characters take 2 bytes per char, i.e. with UTF-16).

Alexander Ivanchenko
  • 25,667
  • 5
  • 22
  • 46
rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
3

It depends on your Charset.defaultCharset(). That determines how the bytes are interpreted. Probably the negative values are a non-canonical way of representing codepoints.

(see this great answer: https://stackoverflow.com/a/7934397/461499)

Re-interpreting the getBytes() to a String will then be the canonical way and will return true

    System.out.println(Charset.defaultCharset()); //UTF-8 here :)

    byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
    String s= new String(arr);
    System.out.println(s);
    // [56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67]

    byte arr2[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -17, -65, -67, -52, -85, -17, -65, -67, -55, -115, 11, -17, -65, -67, -17, -65, -67};
    System.out.println(Arrays.toString(s.getBytes()));  
    System.out.println(Arrays.equals(arr, s.getBytes()));  // returns false

    String s2= new String(arr2);
    System.out.println(Arrays.toString(s2.getBytes()));
    System.out.println(Arrays.equals(arr2, s2.getBytes()));  // returns true
Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121
  • 3
    *Probably the negative values are a non-canonical way of representing codepoints.* Well as I mentioned above, negativity of values is not really relevant as such. An eight-bit character encoding is by definition going to have negative values as `byte` is signed in Java – g00se Apr 18 '22 at 15:36
  • That is true, it does not have to do with them being negative *directly*. However, in UTF-8, these specific values (with the bits starting with 1...) have specific meaning. – Rob Audenaerde Apr 18 '22 at 15:38
  • @RobAudenaerde the reason why the later comparison prints true is that you have manually replaced the non existing utf-8 byte elements or sequence of elements with the question mark character which is a valid utf-8 character and is represented by the byte sequense `-17, -65, -67` – Panagiotis Bougioukos Apr 18 '22 at 16:38
  • I did not do anything with String-output, I just fiddled the byte array. The non-existing input is converted by the `Charset` to a valid sequence. – Rob Audenaerde Apr 19 '22 at 07:33
1

The following constructor will read the byte array and decode it and according to the default charset.

new String(arr);

So when you do

String s= new String(arr);
 s.getBytes()

the bytes() returns the array again as was previously decoded according to the default charset.

If you inspect with debugger you can see how the new String(byte []) method works for UTF-8 default charSet. You will see that the byte {-127} is decoded into {-17, -65, -67} because -127 as byte is not valid for Utf-8. So {-127} is decoded into {-17, -65, -67} because this represents the Replacement character of Utf-8 -> �.

Actually any element or sequence of elements of byte array that can't be matched as a valid Utf-8 character when this is the default charset, it is then converted into {-17, -65, -67} which is the representation for the �.

In your example the following bytes {-9, -127, -23} are non valid for Utf-8 charset. So the previous array of the 3 elements is converted into ��� which in bytes array again is represented from {-17, -65, -67, -17, -65, -67, -17, -65, -67}

So by removing the non valid Utf-8 bytes -9 , -127, -23 from your example will return true for default charset Utf-8 as all of your remaining bytes can be decoded by Utf-8

        byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -52, -85, -55, -115, 11};
        String s= new String(arr);
        System.out.println(Arrays.equals(arr, s.getBytes())); //prints true

This indicates that when you create a String from a byte array the original byte array will be decoded into a new byte array according to the charset. So you can't expect with string.getBytes() to retrieve the original byte array if some of the provided byte elements are non valid according to the related charset.

So in the end we can sum up into the following:

Your code will always return true, as far as all provided elements/sequence of elements in your byte array can be decoded by the underlying charset that JVM uses when it executes your code. If any of those elements or sequence of elements is unknown to the relative charset then it will be decoded into some failing characters which will be later represented by some other byte element or sequence of byte elements which represent those special characters.

Panagiotis Bougioukos
  • 15,955
  • 2
  • 30
  • 47