27

Some legacy code relies on the platform's default charset for translations. For Windows and Linux installations in the "western world" I know what that means. But thinking about Russian or Asian platforms I am totally unsure what their platform's default charset is (just UTF-16?).

Therefore I would like to know what I would get when executing the following code line:

System.out.println("Default Charset=" + Charset.defaultCharset());

PS:

I don't want to discuss the problems of charsets and their difference to Unicode here. I just want to collect what operating systems will result in what specific charset. Please post only concrete values!

Lii
  • 11,553
  • 8
  • 64
  • 88
Robert
  • 39,162
  • 17
  • 99
  • 152

2 Answers2

32

That's a user specific setting. On many modern Linux systems, it's UTF-8. On Macs, it’s MacRoman. In the US on Windows, it's often CP1250, in Europe it's CP1252. In China, you often find simplified chinese (Big5 or a GB*).

But that’s the system default, which each user can change at any time. Which is probably the solution: Set the encoding when you start your app using the system property file.encoding

See this answer how to do that. I suggest to put this into a small script which starts your app, so the user default isn't tainted.

Community
  • 1
  • 1
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • True, the system's default charset can be changed by a user - but how many non-developers does it? – Robert Feb 16 '12 at 14:38
  • 1
    How about people in a corporate network who take their global login with them? All I'm saying is: Never expect any useful value in there. In your code, you should always specify the encoding of data as you read it. If that doesn't work, then you must set `file.encoding` or things **will** break :-) – Aaron Digulla Feb 16 '12 at 14:40
  • 1
    @Aaron Digulla: In cases where the data is supplied by users and comes without encoding metadata, the platform default encoding might actually be your best bet. – Michael Borgwardt Feb 16 '12 at 14:48
  • @Aaron: I agree with you but I will not change the code. I am only interested to know what charsets will be most likely I will encounter. – Robert Feb 16 '12 at 14:49
  • 3
    What for if I may ask? If the charset can change and corrupt your data, you need to handle this by making sure your app doesn't see the user's default. If the charset can change but this has no impact on your app, why bother? – Aaron Digulla Feb 16 '12 at 14:59
  • @Aaron: We are talking about translation loaded by the program. The program checks or UTF16 and UTF-8 and if that does not fit uses the default charset. That logic is now used for years and therefore I have to assume that users are using it for loading their translations files I don't know. Changing it will break everything. Therefore I want to get an overview what other charsets might be involved. – Robert Feb 16 '12 at 19:29
  • If you can, then try to compile a list of encodings which your users use. If everything else fails, you can open this URL: `http://your.doma.in/nosuchpath/translation/` + encoding and then grep the server's error log after a couple of days. You should also try to introduce a new policy that all translation files start with the encoding (for example in a comment). Change your code to print a warning/error if the encoding is missing and change the editor to add it if it's missing. If the encoding is wrong, you don't always get an error in Java which leads to spurious errors. – Aaron Digulla Feb 20 '12 at 13:25
  • Or you could just add `static { System.setProperty("file.encoding", "UTF-8"); }` to your main class to force UTF-8 before anything important happens. Which you should do. UTF-8 Everywhere. – Fordi Jul 21 '16 at 15:50
  • 2
    @Fordi `static` code in classes which are imported by your main class would still be able to see the old value. A much better solution is to invoke Java with `-Dfile.encoding=UTF-8`. But that also won't solve the problem that many file formats simply don't use UTF-8 as default encoding or lazy users will that try to feed files with unknown encodings to the software. – Aaron Digulla Jul 25 '16 at 16:08
8

For Windows and Linux installations in the "western world" I know what that means.

Probably not as well as you think.

But thinking about Russian or Asian platforms I am totally unsure what their platform's default charset is

Usually it's whatever encoding is historically used in their country.

(just UTF-16?).

Most definitely not. Computer usage spread widely before the Unicode standard existed, and each language area developed one or more encodings that could support its language. Those who needed less than 128 characters outside ASCII typically developed an "extended ASCII", many of which were eventually standardized as ISO-8859, while others developed two-byte encodings, often several competing ones. For example, in Japan, emails typically use JIS, but webpages use Shift-JIS, and some applications use EUC-JP. Any of these might be encountered as the platform default encoding in Java.

It's all a huge mess, which is exactly why Unicode was developed. But the mess has not yet disappeared and we still have to deal with it and should not make any assumptions about what encoding a given bunch of bytes to be interpreted as text are in. There Ain't No Such Thing as Plain Text.

GreenGiant
  • 4,930
  • 1
  • 46
  • 76
Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • Michael, you are so super-right it brings me to tears. It’s such a disaster that I’ve even contemplated monkey-patching the standard libraries to forbid the ‘default encoding’. I have terabyte corpora that have been unfixably mutilated by this problem. It’s the unreasonable Java defaults that are the problem here, not Java itself, which can certainly cope with it. I don’t know how to fix it systemically, because being bug-compatible from the beginning of time through its end seems to be Java’s *modus operandi*. I don’t know how to fix design flaws. – tchrist Feb 16 '12 at 15:23
  • 1
    The thing is you can't "not make any assumptions". Users *will* write plain text files with no indication of encoding. Legacy systems *will* store strings with unknown encoding. – plugwash Feb 07 '18 at 15:56