3

Problem:

On an english Windows 10 using slovenian keyboard layout, all command line interfaces seem to have a problem with displaying (printing) UTF-8 characters, namely č, š and ž, which are replaced with ?. (I assume all UTF-8 specific characters, since ć and đ also do not work. )

Tested in:

  • CMD, Powershell, Cmder on Windows 10 64-bit English - Slovenian keyboard layout ... unsuccessful
  • Intellij IDEA on Windows 10 64-bit English language - Slovenian keyboard layout ... successful -> Works as needed in IDE, but not CLI.
  • CMD Windows 10 64-bit English language - English keyboard ... successful
  • CMD Windows 10 64-bit Slovenian languge - Slovenian keyboard layout ... succesful
  • Several distros of Linux (Ubuntu, Mint, Kali) ... successful

Tried so far:

  • changing chcp to chcp 65001 ... unsuccessful
  • creating Autorun file in regedit to force UTF-8 ... unsuccessful
  • different java compilers ... unsuccessful

Sample code:

public class Test2 {
public static void main(String[] args) {
    System.out.println("č š ž ć đ");

    }
}

CMD:

>javac -encoding UTF-8 test2.java
>java Test2
? ? ? ? ? 

Other notes:

Problem appears on several computers running on different hardware. All of the above mentioned characters work fine in all of the above mentioned CLI by default. So the problem only seems to appear with java.

user9420260
  • 33
  • 1
  • 6
  • "On an english Windows 10 using slovenian keyboard layout" If you read the entire sentce, you'll get the clarification you're looking for. I do apologize for not listing that English - English and Slovenian - Slovenian was meant as language of OS and keyboard layout. So yes, the problem seems to only appear on Windows 10 64-bit running in English language with Slovenian keyboard layout. Also for further clarification I would like to add that the problem doesn't appear on Windows 10 64-bit English lang, Slo. keyboard layout using IntelliJ IDEA IDE. – user9420260 Feb 27 '18 at 19:20
  • Your program is attached to a console that it may have inherited from a shell, but the console has nothing directly to do with CMD or PowerShell. It is not a "CMD window". The console system uses instances of a host process (conhost.exe) for the window (Windows 7+) and a device driver (condrv.sys) for the ConDrv device (Windows 8+) that provides console files (Reference, Connect, Input, Output, CurrentIn, CurrentOut, Console). Typically a console client has a handle for Connect (general console API), Input (stdin), and Output (stdout, stderr). – Eryk Sun Feb 27 '18 at 19:25
  • The console screen buffer is UCS-2 Unicode and ideally should be written to using `WriteConsoleW`, a wide-character function. Legacy programs write multibyte strings using `WriteFile` or `WriteConsoleA`. The console uses its output codepage (`GetConsoleOutputCP` and `SetConsoleOutputCP`) to decode the string in this case. UTF-8 is marginally supported as codepage 65001, but it is extremely buggy depending on the version of Windows. For multibyte input (`ReadFile`, `ReadConsoleA`) it's much worse for all versions, including Windows 10, because it fails to read anything except 7-bit ASCII. – Eryk Sun Feb 27 '18 at 19:29

2 Answers2

5

Use chcp 65001 then run with java -Dfile.encoding=UTF-8 Test2:

chcp 65001
javac -encoding UTF-8 Test2.java
java -Dfile.encoding=UTF-8 Test2

Remember to name your Java source file after the class name, case-sensitive.

Andreas
  • 154,647
  • 11
  • 152
  • 247
  • It works, thank you, is there a way to force -Dfile.encoding=UTF-8 to execute automatically? – user9420260 Feb 27 '18 at 19:27
  • @user9420260 It is undocumented how the JVM assigns the default encoding. It is implementation specific, so I can't give you a simple answer. For potential Linux answer, see [How does the Java VM determine its default file.encoding?](https://superuser.com/q/519023) – Andreas Feb 27 '18 at 19:30
  • @user9420260 See also this answer: [Setting the default Java character encoding?](https://stackoverflow.com/a/623036/5221149) – Andreas Feb 27 '18 at 19:35
  • I don't know how Java responses to this, but writing UTF-8 (codepage 65001) to the console prior to Windows 8 (Windows 7 is still *very* common) is generally broken because `WritFile` and `WriteConsoleA` report the wrong number of bytes written; it returns the number of decoded UTF-16 elements written. In this case C/C++ and other language runtimes with buffered streams will automatically try to write what they think are the remaining bytes, and this results in a trail of garbage after every print that contains non-ASCII characters. – Eryk Sun Feb 27 '18 at 19:43
  • Also, Windows 7 still defaults to using an OEM raster font that causes even `WriteConsoleW` (the wide-character version) to outright fail with non-ASCII text if codepage 65001 is selected as the output codepage. – Eryk Sun Feb 27 '18 at 19:46
0

After following @Andreas advice, i have further explored the issue and found a fix that works:

First force cmd to use chcp 65001 (UTF-8) following this link on superuser.

Secondly use the following command:

set JAVA_TOOL_OPTIONS =-Dfile.encoding=UTF-8
user9420260
  • 33
  • 1
  • 6
  • CMD uses the console's wide-character functions `ReadConsoleW` and `WriteConsoleW` to read and write Unicode (UTF-16). Running `chcp.com 65001` has nothing to do with setting anything at all in CMD. You are confusing a shell that uses the console with the actual console. As to codepage 65001, it's a bad solution. It's extremely broken in Windows 7, and even in Windows 10 you won't be able to read non-ASCII user input. That'll go over really well in non-English locales. If Java doesn't have a better answer to this, then its support for the Windows console is fundamentally broken. – Eryk Sun Feb 27 '18 at 22:28
  • @ErykSun Java works well with system encoding. The problem is that Windows does not use Unicode as default encoding AFAK (under W10 e.g.). Programming in Java can lead to deploy on non-Windows system which use Unicode by default. So compiling with UTF-8 support (as mentioned in OP) will imply the `JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8`. However I agree with you on that: `chcp 65001` is not useful here – lauhub Oct 14 '19 at 14:48
  • @lauhub, the console API has supported Unicode in NT since its inception in 1993. Java could support wide-character functions, just as Python 3.6+ does. There are still problems however. The console was designed before Unicode was extended beyond the basic multilingual plane, so it handles UTF-16 surrogate pairs as individual ordinals. Also, for display, it doesn't support font fallback and complex scripts. To address these problems, Microsoft is developing an updated Terminal program that uses the console host (conhost.exe) as a backend server instead of using it as the UI client. – Eryk Sun Oct 14 '19 at 15:49