0

Consider this is byte array, byte[] by = [2, 126, 33, -66, -100, 4, -39, 108]

then if we execute the below code and print it,

String utf8_str = new String(by, StandardCharsets.UTF_8);
System.out.println(utf8_str);

the output is:

\~!���l

Where all the negative values are converted to '�' which means that the byte with -ve value is not in the UTF-8 character set. But the UTF-8 character set has a range of 0 to 255.

If only 0-127 can be shown in +ve in the form of byte datatype, then the numbers greater than 127 can never be used when encoding to UTF-8 character set as Java does not support unsigned byte value.

Any solution for this?

I needed to encode a byte array to UTF-8 character String and get the byte array back from the UTF-8 character String.

But all the character are encoded and retrieved properly except '�'.

when I try to retrieve '�' (i.e, print it's UTF-8 Unicode), it gives some other Unicode rather than the Unicode of the encoded character.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • 2
    `-66` or `0xBE` is not valid as first byte in UTF-8 (UTF-8 is NOT a characters set, but an encoding standard) – user16320675 Feb 17 '23 at 08:17
  • 1
    BTW negative values are not the problem (e.g. `byte[] { -61, -97 }` will be converted to `"ß"`) and it is Java that interprets bytes greater than 127 as negative – user16320675 Feb 17 '23 at 08:39
  • 1
    You are confused. It seems like your real problem is about "I needed to encode a byte array to UTF8 character String and get the byte array back from the UTF8 character String." Can you show a [mcve] of that problem instead? I can assure you this is not about negatives or positives. – Sweeper Feb 17 '23 at 08:40

1 Answers1

1

tl;dr: You can't decode arbitrary bytes as UTF-8, because some byte streams are not conforming UTF-8 streams. If you need to represent arbitrary bytes as String, use something like Base64:

String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);

Not all byte sequences are valid UTF-8

UTF-8 has very specific rules about what bytes sequences are allowed. The short version is:

  • a byte in the range 0x00-0x7F can stand alone (and represents the equivalent character as its ASCII encoding).
  • a byte in the range 0xC2-0xF4 is a leading byte that starts a multi-byte sequence with the exact value indicating the number of continuation bytes
  • a byte in the range 0x80-0xBF is a continuation byte that has to come after a leading byte and possibly some other continuation bytes.

There's a few more rules and nuances to it, but that's the basic idea.

As you can see there are several byte values (0xC0, 0xC1, 0xF5-0xFF) that can't appear in a well-formed UTF-8 stream at all. Additionally some other bytes can only occur in specific sequences. For example a leading byte can never be followed by another leading byte or a stand-alone byte. Similarly a stand-alone byte must never be followed by a continuation byte.

Note about "negative values": byte in Java is a signed data type. But the signed/unsigned debate is not relevant for this topic, as it only matters when calculating with the value or when printing it. It's the 8-bit type to use in Java and the fact that the byte 0xBE is represented as -66 in Java is mostly a visual distinction. For the purposes of this discussion "negative values" is equivalent to "byte values between 0x80 and 0xFF". It just so happens that the non-negative values are exactly the stand alone bytes in UTF-8 and are converted just fine.

All this means that decoding arbitrary byte[] as UTF-8 will not work in most cases!**

Then why doesn't new String(...) throw an exception?

But if arbitraryBytes contains a byte[] that isn't valid UTF-8, then why doesn't new String(arbitraryBytes, StandardCharsets.UTF_8) throw an exception?

Good question! Maybe it should, but the designers of Java have decided that this specific way of decoding a byte[] into a String should be lenient:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string. The CharsetDecoder class should be used when more control over the decoding process is required.

The "default replacement string" in this case is simply the Unicode character U+FFFD Replacement Character, which looks like a question mark in a filled rhombus: �

And as the documentation states, there is of course a way to decode a byte[] to a String and getting a real exception when it doesn't go right:

byte[] arbitraryBytes = new byte[] { 2, 126, 33, -66, -100, 4, -39, 108 };
CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder().onMalformedInput(CodingErrorAction.REPORT);
String string = decoder.decode(ByteBuffer.wrap(arbitraryBytes)).toString();

This code will throw an exception:

Exception in thread "main" java.nio.charset.MalformedInputException: Input length = 1
    at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
    at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:820)
    at org.example.Main.main(Main.java:13)

Okay, but I really need a String!

We have realized that decoding your byte[] to a String using UTF-8 doesn't work. One could use ISO-8859-1, which maps all 256 byte values to characters, but that would result in Strings with many unprintable control characters, which would be quite cumbersome to handle.

Use Base64

The usual solution for this is to use Base64:

// encode byte[] to Base64
String base64 = Base64.getEncoder().encodeToString(arbitraryBytes);
System.out.println(base64);
// decode Base64 to byte[]
byte[] decoded = Base64.getDecoder().decode(base64);
System.out.println(Arrays.equals(arbitraryBytes, decoded));

With the same arbitraryBytes as before this will print

An4hvpwE2Ww=
true

Base64 is a common choice because it is able to represent arbitrary bytes with a reasonable number of characters (on average it will take about a third more characters than it has input bytes, depending on the exact formatting and/or padding used).

There are a few variations of Base64, which are used in various situations. Particularly common is the use of the URL- and filename-safe variant, which ensures that no characters with any special meaning in URLs and file names are used. Luckily it is directly supported in Java.

Format as a hex string

Base64 is neat and useful, but it somewhat obfuscates the individual byte values. Occasionally we want a format that allows us to interpret the values in some way. For this a hexadecimal representation of the data might be more useful, even though it takes up more characters than Base64:

// encode byte[] to hex
String hexFormatted = HexFormat.of().formatHex(arbitraryBytes);
System.out.println(hexFormatted);
// decode hex to byte[]
byte[] decoded = HexFormat.of().parseHex(hexFormatted);
System.out.println(Arrays.equals(arbitraryBytes, decoded));

This will print

027e21be9c04d96c
true

This hex format (without separator) will take exactly 2 characters per input byte, making this format more verbose than Base64.

If you're not yet on Java 17 or later, there are plenty of other ways to do this.

But I've already converted my byte[] to String using UTF-8 and I really need my original data back.

Sorry, but you most likely can't. Unless you were very lucky and your original byte[] happened to be a well-formed UTF-8 stream, the conversion to String will have lost some data and you will only be able to recover a fraction of your original byte[].

String badString = new String(arbitraryBytes, StandardCharsets.UTF_8);
byte[] recoveredBytes = badString.getBytes(StandardCharsets.UTF_8);

This will give you something but every time your input contained a encoding error, this will contain the byte sequence 0xEF 0xBF 0xBD (or -17 -65 -67, when interpreted as signed bytes and printed in decimal). That byte sequence is what UTF-8 encodes the U+FFFD Replacement Character as.

Depending on the specific input (and even the specific implementation of the UTF-8 decoder!) each replacement character can replace one or more bytes, so you can't even reliably tell the size of the original input array like this.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614