Why does additional characters show up when I print unicode characters to Windows 7 console from a Java program?

Question

Below is a Java program to print a unicode character to windows console

import java.io.PrintStream;
import java.io.UnsupportedEncodingException;

public class PrintUnicodeChar {
  public static void main (String[] argv) throws UnsupportedEncodingException {
    String unicodeMessage = "\u00A3"; // Pound sign

    PrintStream out = new PrintStream(System.out, true, "UTF-8");
    out.print(unicodeMessage);
  }
}

I have selected Lucida Console as the font and set the codepage to 65001.

The output I get is £�

If I print the pound sign three times using "\u00A3\u00A3\u00A3", the output becomes £££�£. Printing characters with higher unicode value outputs more � making it more garbled.

Here is another string "\u00A3\n\u00A3\u00A3\n\u00A3\u00A3\u00A3\n\u00A3\u00A3\u00A3\u00A3\n\u00A3\u00A3\u00A3\u00A3\u00A3\n"

The output is

£

££

£££

££££

£££££

�£

£££££

�££

�

What is happening? Is it a problem with the Windows 7 terminal? How to prevent the additional characters from printing?

Prior to Windows 8, `WriteFile` to the console and `WriteConsoleA` incorrectly return the number of wide-characters written instead of the number of bytes. So UTF-8 is not supported for any buffered writer that depends on knowing how many bytes were successfully written. Also, the console doesn't support UTF-8 for input even in Windows 10. Taken together, this is sufficient reason to avoid using UTF-8 with the Windows console. Use the wide-character functions `WriteConsoleW` and `ReadConsoleW`. — Eryk Sun, Feb 27 '19 at 09:56
Here's an example in case it's not obvious how the above-mentioned bug leads to the observed behavior. `"£"` encodes to the two-byte UTF-8 string `"\xc2\xa3"`. When this byte string is written to the console in Windows 7 via WINAPI `WriteFile` or `WriteConsoleA`, the console returns that 1 wide character was written. A buffered writer thus thinks that only `"\xc2"` was written and tries a second write with `"\xa3"`. The system can't decode this byte value as UTF-8, so it uses the replacement character U+FFFD, which is commonly displayed as an empty box or a question mark in a box. — Eryk Sun, Feb 27 '19 at 10:16
In a nutshell, yes it is a problem with the Windows 7 terminal (and still there with all other windows terminals.) You're writing to it in UTF-8, while they're not set to UTF-8 unless you've made your terminal so by means external to java (such as the "chcp 65001" command). — kumesana, Feb 27 '19 at 10:49
@kumesana, the OP already set the console to UTF-8 (65001), which is exactly the problem since the console in Windows 7 *does not* properly support UTF-8. — Eryk Sun, Feb 27 '19 at 10:57

score 0 · Accepted Answer · answered Feb 27 '19 at 12:36

There is nothing wrong with your code.

The problem occurred, because windows does not support UTF-8 in a console property [ similar problem in python: How to display utf-8 in windows console ]

You can bypass this problem by printing to file in the following way:

PrintStream out = new PrintStream(new FileOutputStream(fileDir), true, "UTF8");

Full code:

public static void main(String[] args) throws IOException {
    String unicodeMessage = "\u00A3"; // Pound sign
    File fileDir = new File("c:\\temp\\test.txt");
    PrintStream out = new PrintStream(new FileOutputStream(fileDir), true, "UTF-8");
    out.print(unicodeMessage);
}

Another solution: Run your code in docker on linux vm (in order to avoid windows related issues)

Why does additional characters show up when I print unicode characters to Windows 7 console from a Java program?

1 Answers1