2

I'd like to know how to let my code produce the same output (UTF-8 or UTF16) on different platforms (at least windows and linux).
I thought it was possible to set a codepage to use by the application but I can't find the information to set a codepage. And I don't know if setting a codepage would really produce the same output when using special characters like äöü or other non latin characters.

I'd like to have a solution that works without setting arguments for java.exe.

Edit:
I mean the output to a console. A comment about possible effects on other output media would be nice.

wullxz
  • 17,830
  • 8
  • 32
  • 51
  • Do you mean output to the console? to a file? to a gui? – assylias Jan 05 '13 at 13:05
  • If the output is on a console, it depends on the console capabilities: you don't really have a control on that. – fge Jan 05 '13 at 13:06
  • I'm sorry, I mean to the console. But a comment on how any solution does effect the output to a file/gui/whatever would also be nice. – wullxz Jan 05 '13 at 13:06
  • [this post](http://stackoverflow.com/questions/13348811/get-list-of-processes-on-windows-in-a-charset-safe-way) is about reading from the console on Windows but some concepts apply. – assylias Jan 05 '13 at 22:04

2 Answers2

1

A charset (or codepage, as it used to be called) converts a sequence of characters into a sequence of bytes.

In the Java API, charsets are implemented as subclasses of Charset. All API elements that convert between characters and bytes can be provided with the charset to use (many also allow you to pass the charset name instead, so you don't have to do the lookup yourself). If you do not provide a charset, those methods usually fall back to the operating system's default encoding.

For instance, OutputStreamWriter features a constructor that takes a charset:

try (Writer w = new OutputStreamWriter(System.out, "utf-8")) {
    w.write("Hello world");
}
meriton
  • 68,356
  • 14
  • 108
  • 175
  • I added `w.flush()` behind the `write` statement to let the streamwriter output the buffer. This works in linux but it doesn't in windows. My teststring was `"Hellö Wörld \u262E"`. I also set eclipse to use UTF-8 as default encoding. – wullxz Jan 05 '13 at 13:43
  • 1
    @wullxz It doesn't work on Windows if the target device doesn't accept UTF-8 data. For example, the cmd.exe Command Prompt uses locale-specific OEM codepages from the 1980s by default and old raster fonts - analysis [here](http://illegalargumentexception.blogspot.co.uk/2009/04/i18n-unicode-at-windows-command-prompt.html). Most Linux terminals use UTF-8. – McDowell Jan 05 '13 at 14:08
  • Okay, so it's the windows shell (CMD/Powershell) which messes with my output? Is it possible to let my app check which codepage the current terminal supports and let the OutputStreamWriter then use the appropriate codepage/characterset? – wullxz Jan 05 '13 at 14:20
  • I suppose it's not possible to force the shell to use a specific encoding (utf-8)? I need some unicode signs to be printed in the shell. – wullxz Jan 05 '13 at 14:31
  • @wullxz I've expanded in an answer of my own - unfortunately, you will probably have to compromise somewhere – McDowell Jan 05 '13 at 14:39
1

The Java char type uses UTF-16 which is capable of representing every code point in the Unicode character set. Pretty much all I/O where strings are used involves some implicit transcoding operation.

To save and restore character data without loss or corruption it is generally best to use one of the Unicode transformation formats. There are reader and writer types that can be used to perform this transcoding operation. Avoid the default constructors as they rely on the default encoding which can be a legacy encoding best consigned to decades past. Explicitly specifying UTF-8 is generally preferred.

There are different issues with writing to the terminal. Here you are writing data that will be decoded by another application so you must write character data in a format it understands.

The Console type will detect and use the terminal's encoding whereas System.out uses the default platform encoding - these are different on Windows for a bunch of historical reasons. The other differences are noted here. The documented way to use Unicode in cmd.exe is to use the appropriate Win32 API calls.

Some relevant posts from my blog:

BalusC also has a good post on some of the practical issues of character handling: Unicode - How to get the characters right?

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267