3

Currently, I need to work with the bytes of a String in Java, and it has raised so many questions about encodings and implementation details of the JVM. I would like to know if what I'm doing makes sense, or it is redundant.

To begin with, I understand that at runtime a Java char in a String will always represent a symbol in Unicode.

Secondly, the UTF-8 encoding is always able to successfully encode any symbol in Unicode. In turn, the following snippet will always return a byte[] without doing any replacement. getBytes documentation is here.

byte[] stringBytes = myString.getBytes(StandardCharsets.UTF_8);

Then, if stringBytes is used in a different JVM instance in the following way, it will always yield a string equivalent to myString.

new String(stringBytes, StandardCharsets.UTF_8);

Do you think that my understanding of getBytes is correct? If that is the case, how would you justify it? Am I missing a pathological case which could lead me not to get an equivalent version of myString?

Thanks in advance.


EDIT:

Would you agree that by doing the following any non-exceptional flow leads to a handled case, which allow us to successfully reconstruct the string?


EDIT:

Based on the answers, here goes the solution which allows you to safely reconstruct strings when no exception is thrown. You still need to handle the exception somehow.

First, get the bytes using the encoder:

final CharsetEncoder encoder =
    StandardCharsets.UTF_8.
        .newEncoder()
        .onUnmappableCharacter(CodingErrorAction.REPORT)
        .onMalformedInput(CodingErrorAction.REPORT);


// It throws a CharacterCodingException in case there is a replacement or malformed string
// The given array is actually bigger than required because it is the internal array used by the ByteBuffer. Read its doc.
byte[] stringBytes = encoder.encode(CharBuffer.wrap(string)).array();

Second, construct the string using the bytes given by the encoder (non-exceptional path):

new String(stringBytes, StandardCharsets.UTF_8);
  • Edited: I forgot to write the string instantiation snippet. – Manuel Carrasco May 18 '21 at 10:28
  • There are quite a few assumptions in your question that are almost but not quite true. For example "at runtime a Java char in a String will always represent a symbol in Unicode" is not correct, because a `String` is actually UTF-16 encoded (because a `char` can't hold all possible Unicode codepoints). If you want to be sure you can avoid the shortcut way of getting `byte[]` and use a `CharsetEncoder` directly where you can configure how it handles malformed input (i.e. get notified if it happens instead of silently transforming it into replacement characters). – Joachim Sauer May 18 '21 at 11:04
  • @JoachimSauer Well _technically_, OP didn't say "all Unicode code points are representable as `char`s". They only said "all `char`s represent some Unicode code point", which from what I know, is correct. The implication only goes one way :) – Sweeper May 18 '21 at 11:09
  • @Sweeper: while technically the surrogate values (0xD800-0xDFFF) are Unicode codepoint, these `char` values don't actually represent those Unicode codepoints, but only contain half of the information needed to fnd out what codepoint is actually referred to. – Joachim Sauer May 18 '21 at 11:18
  • @JoachimSauer Really? I didn't know that! I was reading JLS [§3.1](https://docs.oracle.com/javase/specs/jls/se14/html/jls-3.html#jls-3.1) and it said code points and code units are the same in the range 0-ffff, so I was (mis)led to believe that the char `'\ud800'` also represented the code point U+D800. – Sweeper May 18 '21 at 11:25
  • Yes, they are codepoints. But they are reserved for use as surrogate values and thus do not represent (and will never represent) any defined Unicode characters. Any `String` object which contains unpaired surrogates is not valid UTF-16. Java still "supports" (i.e. tolerates) those values, but it's not valid Unicode text. The exact nomenclature here is confusing and I might get some of it wrong ,even though I've spent more time on this topic than I ever wanted. – Joachim Sauer May 18 '21 at 11:58
  • @ManuelCarrasco: it's fine to "answer" your own question, but please do so as an answer. This way it's more visible and can also be voted/commented on separately. FWIW that's the solution I'd also have gone for. – Generous Badger May 19 '21 at 08:58
  • @GenerousBadger Thanks for your input. I did it in this way because my "answer" is not actually answering my first question. The question has been answered by Sweeper. I haven't asked originally how to avoid edge cases but if there were any. Do you still think would be ok to move my "answer" as a proper answer? – Manuel Carrasco May 19 '21 at 10:37

1 Answers1

2

it will always yield a string equivalent to myString.

Well, not always. Not a lot of things in this world happens always.

One edge case I can think of is that myString could be an "invalid" string when you call getBytes. For example, it could have a lone surrogate pair:

String myString = "\uD83D";

How often this will happen heavily depends on what you are doing with myString, so I'll let you think about that on your own.

If myString has a lone surrogate pair, getBytes would encode a question mark character for it:

// prints "?"
System.out.println(
    new String(myString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
);

I wouldn't say a ? is "equivalent" to a malformed string.

See also: Is an instance of a Java string always valid UTF-16?

Sweeper
  • 213,210
  • 22
  • 193
  • 313
  • I'd phrase it like this: all **valid** Unicode strings represented in a well-encoded `String` will roundtrip correctly. Sometimes it's not quite obvious why a given `String` would not be valid. – Joachim Sauer May 18 '21 at 11:06
  • Thanks both for your answers so far. Would you agree that the code in my edit would prevent us to reach unhandled cases? I appreciate the fact that you provided me an edge case which indeed makes my original code fail. – Manuel Carrasco May 18 '21 at 14:38
  • @ManuelCarrasco That would prevent malformed strings from being encoded, sure, but I suggest that you handle the exception somehow. – Sweeper May 19 '21 at 00:35
  • Thank you. Indeed, I just wanted to keep the example code short. – Manuel Carrasco May 19 '21 at 08:41
  • this is exactly why `String::size` says that is returns the "number of code points", as such the size of `String s = "";` returns `2`, for example. That question mark (there are ways to throw an Exception in such cases too, for example), is a way to tell you - welcome to a code point that is supposed to be a surrogate pair, I have no idea what to do with this one. – Eugene May 19 '21 at 14:55