Why is my Java Charset.defaultCharset() GBK and not Unicode?

Question

Config: Windows 8 English operating system; JDK1.7; Eclipse.

I installed a software written by a Chinese, and the GUI is Chinese characters. But the software displays ugly with square boxes. I searched the internet and found a method to fix it. In the control panel of Win8, set "language for non-Unicode Programs" to be "Chinese".

But problem arises when writing code in Eclipse. We know Java itself uses two byte Unicode to store char and String. But when I execute the following code:

import java.util.Arrays;
import java.nio.charset.Charset;

public class CharSetTest {
    public static void main(String[] args) throws Exception {
        System.out.println(Charset.defaultCharset());
        String s = "哈哈";

        byte[] b3 = s.getBytes("UTF-8");
        System.out.println(b3.length);
        System.out.format("%X %X %X\n", b3[0],b3[1],b3[2]);
        System.out.println(new String(b3));

        byte[] b4 = s.getBytes();
        System.out.format("%X %X %X\n", b4[0],b4[1]);
    }
}

The output is weird:

GBK          //default charset is GBK, not Unicode or UTF-8  
3            //this is obvious since a Chinese character is encoded into 3 bytes  
E5 93 88     //this is corresponding UTF-8 code number  
鍝?          //something wrong here  
B9 FE        //I think s.getBytes() should use JAVA's default encode "Unicode", but NOT is this case

Several questions:

What is Java default charset? Is it Unicode? How Java default charset interact with programmers? For example, if Java use Unicode, then a string "abc" cannot be encoded into other charset since they are different from Unicode like charset for Russia, Frence etc, since they are totally different encoding method.
What does Charset.defaultCharset() return? Does it return my Windows 8's default charset?
How does Charset.defaultCharset() return GBK? I didn't set anything in my Windows 8 related default charset except the one for "language for non-Unicode Programs" in control panel.
If I declare a String in Java like this: String str = "abc";, I don't know the process of charset/encoding. I firstly need to input the Java statement by keyboard. How the keyboard translates my key button into Java Unicode charset? The String str is stored in my .java source code file. What is the charset to store Java source code?

EDIT:
Why does we say "Java use Unicode to represent char and String"? In my Java program, when should I care about the Unicode thing? Usually, I only need to care about encoding/decoding with UTF-8 ISO-8859-1 GBK etc. But I never care about Unicode representation of char and String. So how and when should I use the Unicode?

score 2 · Accepted Answer · edited May 23 '17 at 10:28

2

Check the doc: "The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system." So no, the default character set is not necessarily Unicode.

In OpenJDK it is determined from the file.encoding property. See also Setting the default Java character encoding?.

The default file.encoding value is fetched (on Windows) using* the GetUserDefaultLCID() function, which corresponds to the setting in the "Regional and language options". That's why Charset.defaultCharset() is returning GBK, because you set the locale to Chinese.

Although the default character set is OS-dependent, the strings in a compiled Java class are always stored as UTF-16.

The encoding of a *.java source code is whatever you specify to the Java compiler, or the OS's default one if not provided. See Java compiler platform file encoding problem.

*: See http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/tip/src/windows/native/java/lang/java_props_md.c, line 577.

edited May 23 '17 at 10:28

Community

1
1

answered May 17 '13 at 07:06

kennytm

510,854
105
1,084
1,005

Many thanks. Java source file normal text file. So strings declared in source file is encoded as the source file setting. For example, if notepad.exe is used, then in WIN8 ANSI is used to encode the source file for storage on disk. So source.java is stored as ANSI binary format on disk. For this I am sure. But for the compiled source.class file, it is also a normal text file? – Zachary May 17 '13 at 07:31
@Zack: *.class file is a binary file, not a text file. – kennytm May 17 '13 at 07:32
Thx. *.class file is bytecode file, NOT binary file. I wonder whether only Java VM can understand the bytecode file format? – Zachary May 17 '13 at 07:41
@Zack: By "binary" I just mean the opposite the "text". You could write any program to parse the `*.class` format. – kennytm May 17 '13 at 07:42
@peterlawrey In the Java programming language char values represent Unicode characters. So Java uses Unicode by default. Why you said it is OS or implementation dependent? – Zachary May 17 '13 at 07:47

score 1 · Answer 2 · answered May 17 '13 at 07:05

1

the default character set is the character set that Java will use to convert bytes to chars or Strings (and vice versa) if you don't specify anything else (for example if you create a InputStreamReader and don't pass an explicit charset).
Charset.defaultCharset() returns ... the default char set. What exactly that is is implementation dependent, but usually is just what the OS would use in the same case.
That setting is exactly what your Java installation is using: "Chinese" means that some encoding that handles chinese characters has to be provided and GBK matches that just fine.
The encoding of Java source files can be specified when you compile it (using the -encoding parameter). If you don't specify it explictly, then Java will use the platform default encoding (see #1).

answered May 17 '13 at 07:05

Joachim Sauer

302,674
57
556
614

2. I would say it is more OS setting dependant than JVM implementation dependant. – Peter Lawrey May 17 '13 at 07:06
1

@PeterLawrey: that's how it's implemented, but it's not required, at least not according to the JavaDoc of `defaultCharset()` ("... typically ..."). – Joachim Sauer May 17 '13 at 07:07
I believe the implementation would make an effort to be as close a match to the OS setting as it can. Where it is not it is due to a limitation of the JVM not knowing every possible setting (as this is not universally defined) and/or having something for it to map to. Some charsets are expensive and I can imagine embedded systems not supporting them all. – Peter Lawrey May 17 '13 at 07:10
In the Java programming language char values represent Unicode characters. So Java uses Unicode by default. Why you said it is OS or implementation dependent? – Zachary May 17 '13 at 07:38
@Zack: read it more closely. I'm saying the default charset is implementation defined. That's used when converting from `char`/`String` (a.k.a *internal* representation) to `byte[]` (a.k.a. *external* representation). – Joachim Sauer May 17 '13 at 12:56

score 0 · Answer 3 · answered May 17 '13 at 07:04

What is JAVA default charset?

It's picked up from the default set in your OS. This could be Windows-1252-???

Is it Unicode?

This is not a charset. A charset defines how to encode characters as bytes.

How JAVA default charset interact with programmers?

It's the default used when you don't specify a charset.

For example, if JAVA use Unicode, then a string "abc" cannot be encoded into other charset since they are different from Unicode like charset for Russia, Frence etc, since they are totally different encoding method.

Internally Java uses UTF-16 but you don't need to know that. This has no issues with most languages except some Chinese dialects require the use of code points.

What does Charset.defaultCharset() returns?

It does what it appears to do. You can confirm this by reading the javadoc for this method.

Does it return my WIN8's default charset?

Because that is what it is supposed to do. You only have a problem if your OS's character set cannot be mapped into Java or is not correctly mapped into Java. If it is the same, everything is fine.

How Charset.defaultCharset() return GBK. I didn't set anything in my WIN8 related default charset except the one for "language for non-Unicode Programs" in control panel.

It is this because Java thinks you set this for Windows. To correct this, you must have the correct character set in Windows.

If I declare a String in java like: String str = "abc";, I don't know the process of charset/encoding.

For the purposes of this question, there isn't any encoding involved. There is only characters they don't need to be encoded to make characters because they are already characters.

How the keyboard translates my key button into Java Unicode charset?

The keyboard doesn't. It only knows which keys you pressed. The OS turns these keys into characters.

The String str is stored in my .java source code file. What is the charset to store java source code?

That is determined by the editor which does the storing. Most likely it will be the OS default again, or if you change it you might make it UTF-8.

score 0 · Answer 4 · answered Mar 09 '15 at 08:57

0

I am not sure if this could help. To Change encoding in Eclipse: --- Project Explorer --- Right click on Java file --- Run As --- Run Configurations --- Common (tab) --- Encoding (In Linux it is set on UTF-8 by default

answered Mar 09 '15 at 08:57

stavke

1

Why is my Java Charset.defaultCharset() GBK and not Unicode?

4 Answers4