How does the JVM determine the (default?) character encoding for argv on Linux

Question

Java has a default character encoding, which it uses in contexts where a character encoding is not explicitly supplied. The documentation for how it chooses that encoding is vague:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

That documentation has to be vague because the method the JVM uses is system specific.

Using the default character encoding is often a bad idea; it is better to use an explicitly indicated encoding, or to always use the same encoding for some I/O. But one unavoidable use of the default character encoding would seem to be the character encoding used for command-line arguments. On a POSIX system such as Linux, the native (C/C++) code of the JVM gets the command-line arguments as a null terminated list of C/C++ char pointers. Which ought to be thought of as byte pointers, as they must be encoding code points in some (unclear) manner. The JVM has to interpret those sequences of C/C++ chars (bytes) to convert them into a sequence of Java chars, to be given to the main() of the Java program. I assume the JVM uses the default character encoding for this.

So I need to know precisely how the JVM determines the default encoding for a particular system (a modern GNU/Linux operating system), so I can provide user documentation about how my program behaves, and so users of my program can predict how it will behave.

I guess the JVM examines some environment variables, but which ones?

PHP programs can have a [related problem](http://stackoverflow.com/questions/3410424/command-line-character-encoding-from-phps-exec). — Raedwald, Jan 13 '15 at 14:01

score 1 · Answer 1 · answered Jan 13 '15 at 14:22

1

You can ofcourse look at the source code of java.nio.charset.Charset.defaultCharset(). When I do that on my system (64-bit Windows 7, with Oracle JDK 8 update 25) I see this:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
        synchronized (Charset.class) {
            String csn = AccessController.doPrivileged(
                new GetPropertyAction("file.encoding"));
            Charset cs = lookup(csn);
            if (cs != null)
                defaultCharset = cs;
            else
                defaultCharset = forName("UTF-8");
        }
    }
    return defaultCharset;
}

In other words, it looks at the system property file.encoding and if it cannot find a matching Charset instance, it uses UTF-8.

answered Jan 13 '15 at 14:22

Jesper

202,709
46
318
350

1

This means that the [doc of Charset.defaultCharset()](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/nio/charset/Charset.html#defaultCharset()) is a bit imprecise, and therefore that if the `-Dfile.encoding` flag is not used it will simply use UTF-8, without any reading from the underlying so? Cfr the doc: "determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system". – Mabsten Jan 06 '21 at 20:02
@Mabsten not necessarily. I suspect there is a default setting for `file.encoding` somewhere in the JVM, which will be used if you don't set it explicitly with `-D`. That default will depend on the operating system you're using. – Jesper Jan 07 '21 at 08:15

How does the JVM determine the (default?) character encoding for argv on Linux

1 Answers1

Linked