You seem to think that bytes and characters are interchangeable.
They simply are not.
To turn characters into bytes, you 'encode' the characters using a 'charset encoding'. To turn bytes back into characters, you decode them using a 'charset encoding'. There is no such thing as converting one to the other without a charset encoding.
The transition bytes->chars->bytes is only 'perfect' (guaranteed to always give you the same byte array back) for a select few encoding systems. Most encoding systems do not have this property. An encoding system that does, is ISO-8859-1. However, the 2 most common encodings do not have this property: Neither UTF-8 nor US-ASCII gets the job done.
The methods you use here (both str.getBytes
as well as new String(byteArr)
) use the 'platform default encoding'. Starting with JDK18, that's guaranteed to be UTF-8 (thus guaranteeing that this will not work properly), and before that, it's whatever your system's default encoding is, which we don't know.
US-ASCII doesn't work because US_ASCII only defines a subset of all bytes as 'valid': 0-126. Most of your bytes (all of them with a minus sign) aren't valid ASCII.
UTF-8 doesn't work because not all byte sequences are valid UTF-8. In other words, there are sequences of bytes that simply cannot be produced with UTF_8.
More to the point though, the entire principle is just broken. Even if you know it's ISO-8859-1, what are you trying to accomplish by doing this? You may be able to translate an arbitrary byte array into ISO-8859-1 and back again without losing anything, but what point does this serve? You can easily produce strings that cause havoc, with NUL characters, tabs, backspaces, 'bell' sounds, and other bizarreness. It's a string you'd never ever want to print. Which asks the question: Why do you want one, then?
There really is only one sensible answer to that question, and that is: I wish to transport these bytes through a medium that only supports strings. For example, I have some raw bytes, and I want to put them in an email, or in a form field for a jira ticket or something silly like that, and an attachment is for some reason not an option in this. Or I want to stuff it into a URL (https://www.foo.bar/?q=raw-bytes-here
).
There are 2 answers to doing that, and neither involve new String(byteArr)
:
Nibbles
Any raw byte can trivially be turned into hexadecimal representation: 255 (or -1, in signed byte form, it's the same thing) turns into FF
. 1 turns into 01 - all bytes are always exactly 2 characters in length. You can use:
byte f = -1;
String nibbled = String.format("%02X", (int) f);
System.out.println(nibbled); // prints 'FF'
The individual letter/digit (0-9A-F. Technically that's just a digit, in hexadecimal, where A-F are also digits) is called a 'nibble' (because it's half a byte, see. Boy, the 60s when these terms were invented were a hoot weren't they).
This is somewhat inefficient; a byte array of X bytes turns into a string of 2*X characters (and each character may well take 2 bytes, e.g. if it is UTF-16 encoded, for a total of 25% efficiency, ouch). But it is trivially readable and common. It's great for short (sub 500 or so bytes) byte arrays.
Another advantage is that you can eyeball the string and know what the data is, if you can read hexadecimal and, if signed is relevant, 2's complement, which is not too difficult.
Base64
Base64 is a simple encoding scheme that defines 64 'safe' characters that you know will safely 'survive' without getting mangled or misinterpreted. That gives you 6 bits of data per character. Bytes are 8, so, you can 'stuff' 3 bytes into 4 characters this way; for example a 900 byte array turns into 1200 characters.
Java has base64 encoding/decoding built in.
byte arr[] = new byte[] {56, 99, 87, 77, 73, 90, 105, -23, -52, -85, -9, -55, -115, 11, -127, -127};
String s = Base64.getEncoder().encodeToString(arr);
// s is all ASCII chars and safe to include just about everywhere.
// URL parameter, emails, web forms, you name it.
byte[] arr2 = Base64.getDecoder().decode(s);
Arrays.equals(arr, arr2); // true, guaranteed.
Base64 is slightly more complicated, and you can no longer eyeball a base64 string and just see the bytes matrix-style. But it is more efficient than nibble form: 75% efficiency (or 37.5% if the underlying characters take 2 bytes per char, i.e. with UTF-16).