0

The following is all run in PowerShell 3.0 under the standard console as well as with Powershell ISE and using a font that contains the tested unicode codepoint.

The following C# program correctly prints ~ (so we know it can work):

static void Main(string[] args)
{
    Console.WriteLine("\u2248");
}

On a sidenote when I look at Console.OutputEncoding it claims to be codepage IBM850 which certainly can't be true. Even weirder is that independent of what I set the codepage of the console to (using chcp) the output is fine, so .NET has to worry about the encoding itself (or calling some special APIs?)

Now when I try the following Java program I end up with garbled output ( "H):

public static void main(String[] args) throws UnsupportedEncodingException {
    System.out.println("\u2248");
}

Now that is because Java looks at the system encoding and uses that, which will be windows-1252, so that's as expected, but the following also doesn't work:

public static void main(String[] args) throws UnsupportedEncodingException {
    new PrintStream(System.out, true, "UTF-16").println("\u2248");
}

What I can do is to use UTF-8 and call chcp 65001 beforehand. This works and then shows the right glyph, but has a bug where some characters are repeated at the end of the line: Printing \u2248weird. results in ≈weird.d. so this is not great either.

So what encoding is C# using to write to the console, or more generally how can I get Java to correctly output Unicode in PowerShell?

Voo
  • 29,040
  • 11
  • 82
  • 156
  • Have you tried, after setting `chcp 65001` running `java -Dfile.encoding=UTF-8` and using `System.out.println()` instead of creating a `PrintStream` over a `PrintStream`? – RealSkeptic Jul 11 '15 at 12:47
  • @RealSkeptic That's my actual plan for how to make existing jar files work correctly without having to change the code (so I hope it's just a encoding I need and not some weird Win32 API calls). It behaves exactly the same way as using the PrintStream though. – Voo Jul 11 '15 at 12:49
  • The comments to this [answer](http://stackoverflow.com/a/388500/4125191) may help you understand why the bug is happening. Sorry that I don't have a solution, though. Might try 1200 or 1201 for utf-16. – RealSkeptic Jul 11 '15 at 13:11
  • @RealSkeptic Tried that too, but those two codepages are not supported by `chcp` - which yes I also find incredibly weird (invalid codepage, really?). And yeah I assumed it was something like that wrt the bug. Heck, since it's pretty much reading from undefined memory I could even crash the console, annoying indeed but so far this seems to be my best bet. – Voo Jul 11 '15 at 13:14

1 Answers1

2

what encoding is C# using to write to the console

None, .NET is using the Win32 API WriteConsoleW to write characters (well, UTF-16 code units) directly. There is no encode/decode-from-bytes step, so the console's code page is irrelevant. (And yes, 850 is the expected code page for Western Europe.)

Other apps and languages including Java are using the C standard library IO functions which deal in bytes, so there's an encode-decode stage involved and this does use the console code page.

What I can do is to use UTF-8 and call chcp 65001 beforehand. This works and then shows the right glyph, but has a bug where some characters are repeated

This is part of a set of long-standing bugs in Windows command line support for code page 65001. Generally code page 65001 is not a viable way to get C-stdlib applications to support Unicode on the console for this reason.

Generally there is no pure cross-platform way to write command-line apps that support Unicode. You have to detect that you're connected to a character-oriented console (rather than a byte-oriented pipe) and running on Windows, and in that case branch to call Win32 WriteConsoleW. Example using JNA.

Community
  • 1
  • 1
bobince
  • 528,062
  • 107
  • 651
  • 834