Currently, I need to work with the bytes of a String in Java, and it has raised so many questions about encodings and implementation details of the JVM. I would like to know if what I'm doing makes sense, or it is redundant.
To begin with, I understand that at runtime a Java char in a String will always represent a symbol in Unicode.
Secondly, the UTF-8 encoding is always able to successfully encode any symbol in Unicode. In turn, the following snippet will always return a byte[] without doing any replacement. getBytes documentation is here.
byte[] stringBytes = myString.getBytes(StandardCharsets.UTF_8);
Then, if stringBytes
is used in a different JVM instance in the following way, it will always yield a string equivalent to myString
.
new String(stringBytes, StandardCharsets.UTF_8);
Do you think that my understanding of getBytes
is correct? If that is the case, how would you justify it? Am I missing a pathological case which could lead me not to get an equivalent version of myString
?
Thanks in advance.
EDIT:
Would you agree that by doing the following any non-exceptional flow leads to a handled case, which allow us to successfully reconstruct the string?
EDIT:
Based on the answers, here goes the solution which allows you to safely reconstruct strings when no exception is thrown. You still need to handle the exception somehow.
First, get the bytes using the encoder:
final CharsetEncoder encoder =
StandardCharsets.UTF_8.
.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.onMalformedInput(CodingErrorAction.REPORT);
// It throws a CharacterCodingException in case there is a replacement or malformed string
// The given array is actually bigger than required because it is the internal array used by the ByteBuffer. Read its doc.
byte[] stringBytes = encoder.encode(CharBuffer.wrap(string)).array();
Second, construct the string using the bytes given by the encoder (non-exceptional path):
new String(stringBytes, StandardCharsets.UTF_8);