5

I have been trying to retrieve "unicode user input" in my Java application for a small utility snippet. The problem is, it seems to be working on Ubuntu "out of the box" which has I guess OS wide encoding at UTF-8 but doesn't work on Windows when run from "cmd". The code in consideration is as follows:

public class SerTest {

    public static void main(String[] args) throws Exception {
        testUnicode();
    }

    public static void testUnicode() throws Exception {
        System.out.println("Default charset: " +
           Charset.defaultCharset().name());
        BufferedReader in  =
           new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
        System.out.printf("Enter 'абвгд эюя': ");
        String line = in.readLine();
        String s = "абвгд эюя";
        byte[] sBytes = s.getBytes();
        System.out.println("strg bytes: " + Arrays.toString(sBytes));
        byte[] lineBytes = line.getBytes();
        System.out.println("line bytes: " + Arrays.toString(lineBytes));
        PrintStream out = new PrintStream(System.out, true, "UTF-8");
        out.print("--->" + s + "<----\n");
        out.print("--->" + line + "<----\n");
    }

}

Output on Ubuntu (without any changes to configuration):

me@host> javac SerTest.java  && java SerTest
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

Output on windows CMD prompt (in no way affected by JAVA_TOOL_OPTIONS):

E:\>chcp 65001
Active code page: 65001

E:\>java -Dfile.encoding=utf8 SerTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
Default charset: UTF-8
Enter 'абвгд эюя': юя': ': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Exception in thread "main" java.lang.NullPointerException
        at SerTest.testUnicode(SerTest.java:26) # byte[] lineBytes = line.getBytes();
        at SerTest.main(SerTest.java:15)

Output in Eclipse console (after using JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----

On Eclipse console, it is working because I have added a system wide environment variable (JAVA_TOOL_OPTIONS) which if possible I would like to avoid.

Output in Eclipse console (after removing JAVA_TOOL_OPTIONS):

Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-61, -112, -62, -80, -61, -112, -62, -79, -61, -112, -62, -78, -61, -112, -62, -77, -61, -112, -62, -76, 32, -61, -111, -17, -65, -67, -61, -111, -59, -67, -61, -111, -17, -65, -67]
--->абвгд эюя<----
--->абвгд �ю�<----

So my question is: what exactly is going on here? What code changes would be required to ensure that this snippet works for all sorts of "Unicode" input?

Sorry for the long winded question and thanks in advance,
Sasuke

sasuke
  • 6,589
  • 5
  • 36
  • 35

2 Answers2

4

Some notes:

  • -Dfile.encoding=utf8 is not supported and may cause unintended side-effects:

The "file.encoding" property is not required by the J2SE platform specification; it's an internal detail of Sun's implementations and should not be examined or modified by user code. It's also intended to be read-only; it's technically impossible to support the setting of this property to arbitrary values on the command line or at any other time during program execution.

  • The Console class will detect and use the terminal encoding but doesn't support 65001 (UTF-8) on Windows - at least, it didn't the last time I tried it

I believe that the correct, documented way to use Unicode with cmd.exe is to use WriteConsoleW and ReadConsoleW.

I wrote a couple of blog posts when I was looking at this:

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • 1
    Ah, so basically no sane way of reading/writing unicode stuff when writing windows command line apps? And here I was debugging UTFEncoder/Decoder from sun.* packages... – sasuke Jan 02 '12 at 06:33
  • As far as I am aware, there is no cross-platform way. There are a number of 3rd party console libraries out there that may give you a common interface to write to for all platforms but I don't know what level of I18N support they have. – McDowell Jan 02 '12 at 13:03
  • Thanks. I guess I'll have to look into the few curses implementations floating around (like this one: http://slashie.net/libjcsi/) and hope they handle unicode in a sane way. Accepted! – sasuke Jan 03 '12 at 06:53
3

NPE is throws when you are trying to call Arrays.toString(lineBytes), that means that lineBytes is null.

lineBytes holds value: line.getBytes(). getBytes() can return null only if UnsupportedEncodingException is throws inside.

It happens on windows because windows command prompt does not support unicode by default. This works on Ubuntu because its command prompt is fully unicode enabled. It works partially with eclipse because Eclipse's console window is a java component that supports unicode for input and does it for output with JAVA_TOOL_OPTIONS.

The bottom line is that you wish to configure windows command prompt to be able to use unicode characters. I saw several discussions on this topic. Please take a look on this one: Unicode characters in Windows command line - how?

I hope this will help you.

Community
  • 1
  • 1
AlexR
  • 114,158
  • 16
  • 130
  • 208
  • That's the way to go. I don't think anyone could add anything to this answer. – Milad Naseri Dec 29 '11 at 14:53
  • Thanks for the reply. A couple of clarifications: The NPE is because of calling `getBytes()` on `line` which means `line` is NULL which doesn't make a lot of sense. I can confirm that there is no `UnsupportedEncodingException` thrown (at least I don't see it). Lastly, I tried out the suggestion mentioned in the linked thread, same result. Any idea what might be going bad here? – sasuke Dec 29 '11 at 14:59
  • @sasuke, I think you are wrong. See your stack trace: at SerTest.testUnicode(SerTest.java:26)line.getBytes(); at SerTest.main(SerTest.java:15) that means that there are 11 lines between main() and point where NPE is thrown. And this is exactly `byte[] lineBytes = line.getBytes();`. – AlexR Dec 29 '11 at 15:18
  • Hi Alex, I can tell it's `line.getBytes()` because I added a new line `System.out.println(line)` and it gave me `null`. Also, if you are on Windows, I would appreciate if you could run the same code and let me know if it works for you. Thanks. – sasuke Dec 29 '11 at 15:32