I've already read the following posts:
- What is the Java's internal represention for String? Modified UTF-8? UTF-16?
- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html
Now consider the code given below:
public static void main(String[] args) {
printCharacterDetails("最");
}
public static void printCharacterDetails(String character){
System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0)));
byte[] bytes = character.getBytes();
System.out.println("The UTF-8 Character="+character+" | Default: Number of Bytes="+bytes.length);
String stringUTF16 = new String(bytes, StandardCharsets.UTF_16);
System.out.println("The corresponding UTF-16 Character="+stringUTF16+" | UTF-16: Number of Bytes="+stringUTF16.getBytes().length);
System.out.println("----------------------------------------------------------------------------------------");
}
When I tried to debug the line character.getBytes()
in the code above, the debugger took me into the getBytes()
method of String class and then subsequently into the static byte[] encode(char[] ca, int off, int len)
method of StringCoding class. The first line of the encode method (String csn = Charset.defaultCharset().name();
) returned "UTF-8" as the default encoding during the debugging. I expected it to be "UTF-16".
The output of the program is:
Unicode Value for 最=6700 The UTF-8 Character=最 | Default: Number of Bytes=3
The corresponding UTF-16 Character=� | UTF-16: Number of Bytes=6
When I converted it to UTF-16 explicitly in the program it took 6 bytes to represent the character. Shouldn't it use 2 or 4 bytes for UTF-16? Why 6 bytes were used?
Where am I going wrong in my understanding? I use Ubuntu 14.04 and the locale command shows the following:
LANG=en_US.UTF-8
Does this mean that JVM decides which encoding to use on the basis of underlying OS or does it use UTF-16 only? Please help me understand the concept.