tl;dr: You can't decode arbitrary bytes as UTF-8, because some byte streams are not conforming UTF-8 streams. If you need to represent arbitrary bytes as String, use something like Base64:
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
Not all byte sequences are valid UTF-8
UTF-8 has very specific rules about what bytes sequences are allowed. The short version is:
- a byte in the range 0x00-0x7F can stand alone (and represents the equivalent character as its ASCII encoding).
- a byte in the range 0xC2-0xF4 is a leading byte that starts a multi-byte sequence with the exact value indicating the number of continuation bytes
- a byte in the range 0x80-0xBF is a continuation byte that has to come after a leading byte and possibly some other continuation bytes.
There's a few more rules and nuances to it, but that's the basic idea.
As you can see there are several byte values (0xC0, 0xC1, 0xF5-0xFF) that can't appear in a well-formed UTF-8 stream at all. Additionally some other bytes can only occur in specific sequences. For example a leading byte can never be followed by another leading byte or a stand-alone byte. Similarly a stand-alone byte must never be followed by a continuation byte.
Note about "negative values": byte
in Java is a signed data type. But the signed/unsigned debate is not relevant for this topic, as it only matters when calculating with the value or when printing it. It's the 8-bit type to use in Java and the fact that the byte 0xBE
is represented as -66
in Java is mostly a visual distinction. For the purposes of this discussion "negative values" is equivalent to "byte values between 0x80 and 0xFF". It just so happens that the non-negative values are exactly the stand alone bytes in UTF-8 and are converted just fine.
All this means that decoding arbitrary byte[]
as UTF-8 will not work in most cases!**
Then why doesn't new String(...)
throw an exception?
But if arbitraryBytes
contains a byte[]
that isn't valid UTF-8, then why doesn't new String(arbitraryBytes, StandardCharsets.UTF_8)
throw an exception?
Good question! Maybe it should, but the designers of Java have decided that this specific way of decoding a byte[]
into a String
should be lenient:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder
class should be used when more control over the decoding process is required.
The "default replacement string" in this case is simply the Unicode character U+FFFD Replacement Character, which looks like a question mark in a filled rhombus: �
And as the documentation states, there is of course a way to decode a byte[]
to a String
and getting a real exception when it doesn't go right:
byte[] arbitraryBytes = new byte[] { 2, 126, 33, -66, -100, 4, -39, 108 };
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
String string = decoder.decode(ByteBuffer.wrap(arbitraryBytes)).toString();
This code will throw an exception:
Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:820)
at org.example.Main.main(Main.java:13)
Okay, but I really need a String!
We have realized that decoding your byte[]
to a String
using UTF-8 doesn't work. One could use ISO-8859-1, which maps all 256 byte values to characters, but that would result in Strings with many unprintable control characters, which would be quite cumbersome to handle.
Use Base64
The usual solution for this is to use Base64:
// encode byte[] to Base64
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
System.out.println(base64);
// decode Base64 to byte[]
byte[] decoded = Base64.getDecoder().decode(base64);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
With the same arbitraryBytes
as before this will print
An4hvpwE2Ww=
true
Base64 is a common choice because it is able to represent arbitrary bytes with a reasonable number of characters (on average it will take about a third more characters than it has input bytes, depending on the exact formatting and/or padding used).
There are a few variations of Base64, which are used in various situations. Particularly common is the use of the URL- and filename-safe variant, which ensures that no characters with any special meaning in URLs and file names are used. Luckily it is directly supported in Java.
Format as a hex string
Base64 is neat and useful, but it somewhat obfuscates the individual byte values. Occasionally we want a format that allows us to interpret the values in some way. For this a hexadecimal representation of the data might be more useful, even though it takes up more characters than Base64:
// encode byte[] to hex
String hexFormatted = HexFormat.of().formatHex(arbitraryBytes);
System.out.println(hexFormatted);
// decode hex to byte[]
byte[] decoded = HexFormat.of().parseHex(hexFormatted);
System.out.println(Arrays.equals(arbitraryBytes, decoded));
This will print
027e21be9c04d96c
true
This hex format (without separator) will take exactly 2 characters per input byte, making this format more verbose than Base64.
If you're not yet on Java 17 or later, there are plenty of other ways to do this.
But I've already converted my byte[]
to String
using UTF-8 and I really need my original data back.
Sorry, but you most likely can't. Unless you were very lucky and your original byte[]
happened to be a well-formed UTF-8 stream, the conversion to String
will have lost some data and you will only be able to recover a fraction of your original byte[]
.
String badString = new String(arbitraryBytes, StandardCharsets.UTF_8);
byte[] recoveredBytes = badString.getBytes(StandardCharsets.UTF_8);
This will give you something but every time your input contained a encoding error, this will contain the byte sequence 0xEF 0xBF 0xBD (or -17 -65 -67, when interpreted as signed bytes and printed in decimal). That byte sequence is what UTF-8 encodes the U+FFFD Replacement Character as.
Depending on the specific input (and even the specific implementation of the UTF-8 decoder!) each replacement character can replace one or more bytes, so you can't even reliably tell the size of the original input array like this.