Java's native character set for Strings

Question

I am utterly confused by the answers that I have seen on stackoverflow plus on java docs

While all theory in the docs and stack in the links above seem to point that UTF-16 is the native character set supported by Java, there is another theory that says it depends on the JVM/OS e.g. in this link, it says:

Every instance of the Java virtual machine has a default charset, which may or may not be one of the standard charsets. The default charset is determined during virtual-machine startup and typically depends upon the locale and charset being used by the underlying operating system.

Then in the same link in another section it says

The native character encoding of the Java programming language is UTF-16.

I am finding it difficult to understand this apparently contradicting statements as:

one says it is dependent on OS
the other (I infer) says, regardless of the OS, UTF-16 is the charset for Java (This is also what all of the links I have mentioned above say)

Again, now, when I execute the following piece of code:

package org.sheel.classes;

import java.nio.charset.Charset;

public class Test {

    public static void main(String[] args) {
         System.out.println(Charset.defaultCharset());
    }

}

...in an online editor I get to see UTF-8. In my local system I get to see windows-1252

And lastly, there is a JDK Enhancement Proposal (JEP) which talks about changing the default to UTF-8

Could there be an explanation for this confusion?

I think the second section is referring to the encoding of `.java` files, not the charset for `String`s. — ifly6, Jun 04 '18 at 20:23
I hope it's clear from the answers that the user's default character encoding is almost never relevant in this century. — Tom Blodget, Jun 11 '18 at 01:50

roberto · Answer 1 · 2018-06-05T08:47:50.000

3

A String internally is an array of char, toCharArray(), each char being a utf-16 codepoint. When you convert the string to an array of byte without specifying the charset, getBytes(), the OS one is used.

PS: as noted by VGR, recent implementations may not store String as array of char, but as programmers we usually interact using chars which are always UTF-16.

edited Jun 05 '18 at 08:47

answered Jun 04 '18 at 20:45

roberto

29
2

So does this mean that if the OS charset is UTF-8 and I have a String s="hi there" in my code, then this string is stored internally in utf-16 and s.getBytes() without specifying the charset will get me the bytes as per utf-8 encoding? If yes does this not also mean that the if another string is constructed using these bytes array its going to give you a probably different string value? – Sheel Pancholi Jun 04 '18 at 21:02
@SheelPancholi Yes. (Recent versions of Java may store Strings internally using something other than UTF-16, but that doesn’t matter, because it is impossible for a program to know about it. A `char` is always a UTF-16 value, regardless of what String does internally.) – VGR Jun 04 '18 at 21:08
Yes, It is. But if your app il going to run on more than one OS or more than one language/country, always specify a charset to avoid conversion errors. As per the JEP, UTF-8 is a good choice. – roberto Jun 04 '18 at 21:15
@VGR Would you be able to elaborate on this? To say, "A char is always a UTF-16 value, regardless of what String does internally" sounds like a recursively contradictory sentence. A char is a UTF-16 value. I agree. But isn't that exactly what we mean when we say what a character in Java is internally stored like? So how can it be "regardless"? And to repeat my question, while converting such an "internally stored utf-16 string" into its byte array with the OS charset e.g. Utf-8 do we run the risk of getting a different string when constructing one from such a byte array? – Sheel Pancholi Jun 04 '18 at 21:20
@Sheel Keep in mind there is a difference between code units and code units serialized to bytes, even though almost always when we speak of applying a character encoding, we mean the resulting byte sequence from applying both the encoding and the serialization. – Tom Blodget Jun 11 '18 at 01:47

score 2 · Answer 2 · answered Jun 04 '18 at 22:10

The internal encoding used by String has nothing to do with the platform’s default charset. They are completely independent of each other.

String internals

Internally, a String may store its data as anything. As programmers, we don’t interact with the private implementation; we can only use public methods. The public methods usually return a String’s data as UTF-16 (char values), though some, like the codePoints() method, can return full UTF-32 int values. None of those methods indicate how String data is stored internally, only the forms in which a programmer may examine that data.

So, rather than saying that String stores data internally as UTF-16 or any other encoding, it’s correct to say that String stores a sequence of Unicode code points, and makes them available in various forms, most commonly as char values.

Default charset

The default charset is something Java obtains from the underlying system.

As roberto pointed out, the default charset matters when you use certain (outdated) methods and constructors. Converting a String to bytes, or converting bytes to a String, without explicitly specifying a charset, will make use of the default charset. Similarly, creating an InputStreamReader or OutputStreamWriter without specifying a charset will use the default charset.

It is usually unwise to rely on the default charset, as it will make your code behave differently on different platforms. Also, some charsets can represent all known characters, but some charsets can represent only a small subset of the total Unicode repertoire. In particular, Windows usually has a default charset which uses a single byte to represent each character (windows-1252 in US versions of Windows), and obviously that isn’t enough space for the hundreds of thousands of available characters.

If you rely on the default charset, there is indeed a chance that you will lose information:

String s = "\u03c0\u22603"; // "π≠3"

byte[] bytes = s.getBytes();

for (byte b : bytes) {
    System.out.printf("%02x ", b);
}
System.out.println();

On most systems, this will print:

cf 80 e2 89 a0 33

On Windows, this will probably print:

3f 3f 33

The pi and not-equal characters aren’t represented in the windows-1252 charset, so on Windows, the getBytes method replaces them with question marks (byte value 3f).

If conversion to or from bytes is not involved, String objects will never lose information, because regardless of how they store their data internally, the String class guarantees that every character will be preserved.

Java's native character set for Strings

2 Answers2

String internals

Default charset