I have been trying to retrieve "unicode user input" in my Java application for a small utility snippet. The problem is, it seems to be working on Ubuntu "out of the box" which has I guess OS wide encoding at UTF-8 but doesn't work on Windows when run from "cmd". The code in consideration is as follows:
public class SerTest {
public static void main(String[] args) throws Exception {
testUnicode();
}
public static void testUnicode() throws Exception {
System.out.println("Default charset: " +
Charset.defaultCharset().name());
BufferedReader in =
new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
System.out.printf("Enter 'абвгд эюя': ");
String line = in.readLine();
String s = "абвгд эюя";
byte[] sBytes = s.getBytes();
System.out.println("strg bytes: " + Arrays.toString(sBytes));
byte[] lineBytes = line.getBytes();
System.out.println("line bytes: " + Arrays.toString(lineBytes));
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.print("--->" + s + "<----\n");
out.print("--->" + line + "<----\n");
}
}
Output on Ubuntu (without any changes to configuration):
me@host> javac SerTest.java && java SerTest
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----
Output on windows CMD prompt (in no way affected by JAVA_TOOL_OPTIONS):
E:\>chcp 65001
Active code page: 65001
E:\>java -Dfile.encoding=utf8 SerTest
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
Default charset: UTF-8
Enter 'абвгд эюя': юя': ': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Exception in thread "main" java.lang.NullPointerException
at SerTest.testUnicode(SerTest.java:26) # byte[] lineBytes = line.getBytes();
at SerTest.main(SerTest.java:15)
Output in Eclipse console (after using JAVA_TOOL_OPTIONS):
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=utf8
line bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
--->абвгд эюя<----
--->абвгд эюя<----
On Eclipse console, it is working because I have added a system wide environment variable (JAVA_TOOL_OPTIONS) which if possible I would like to avoid.
Output in Eclipse console (after removing JAVA_TOOL_OPTIONS):
Default charset: UTF-8
Enter 'абвгд эюя': абвгд эюя
strg bytes: [-48, -80, -48, -79, -48, -78, -48, -77, -48, -76, 32, -47, -115, -47, -114, -47, -113]
line bytes: [-61, -112, -62, -80, -61, -112, -62, -79, -61, -112, -62, -78, -61, -112, -62, -77, -61, -112, -62, -76, 32, -61, -111, -17, -65, -67, -61, -111, -59, -67, -61, -111, -17, -65, -67]
--->абвгд эюя<----
--->абвгд �ю�<----
So my question is: what exactly is going on here? What code changes would be required to ensure that this snippet works for all sorts of "Unicode" input?
Sorry for the long winded question and thanks in advance,
Sasuke