11
public static void main(String[] args) throws IOException {
   String str1 = "ΔΞ123456";
   System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true

   BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
   String str2 = br.readLine(); //ΔΞ123456 same as str1.
   System.out.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")); //Δ�123456-false

   System.out.println(str1.equals(str2)); //false
}

The same String doesn't match regex when read from keyboard.
What causes this problem, and how can we solve this?
Thanks in advance.

EDIT: I used System.console() for input and output.

public static void main(String[] args) throws IOException {
        PrintWriter pr = System.console().writer();

        String str1 = "ΔΞ123456";
        pr.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str1.length());

        String str2 = System.console().readLine();
        pr.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str2.length());

        pr.println("str1.equals(str2)="+str1.equals(str2));
}

Output:

ΔΞ123456-true-8
ΔΞ123456
ΔΞ123456-true-8
str1.equals(str2)=true

athspk
  • 6,722
  • 7
  • 37
  • 51
  • 4
    How do you know that str2 is the same as str1? What method did you use to verify that they are in fact the same? – Mark Byers Jan 02 '11 at 15:30
  • You are right Mark. Its not the same as i thought. – athspk Jan 02 '11 at 15:58
  • Print out all java properties: System.getProperties().list(System.out); Look for "file.encoding". – Kennet Jan 02 '11 at 15:58
  • 1
    Be aware that there are code points within the Greek block that are not in the Greek script, and similarly there are *many* code points which **are** in the Greek script but which **are not** in Greek block. The [unichars program](http://training.perl.com/scripts/unichars) run as `unichars -u '\p{InGreek}' '\P{IsGreek}' | wc -l` shows there are 28 of the first group, while `unichars -a '\p{IsGreek}' '\P{InGreek}' | wc -l` shows there are 395 in the second group. See also the [uniprops program](http://training.perl.com/scripts/uniprops) for exploring things the other way around. – tchrist Jan 02 '11 at 18:23
  • 2
    I should probably add that **Java doesn’t support Unicode script types until JDK7!** You can kinda use `[\p{InGreek}\p{InGreekExtended}\p{InAncientGreekNumbers}\p{InAncientGreekMusicalNotation}]`, but there are still 66 code points in those four blocks that are **not** of type `Script=Greek`. – tchrist Jan 02 '11 at 18:27
  • Which IDE are you using ? +1 for JDK version ! – Stefanos Kalantzis Jan 02 '11 at 18:28
  • Did you both compile with `javac ‑encoding UTF‑8` and also run with `java ‑Dfile.encoding=UTF‑8`? – tchrist Jan 02 '11 at 18:31
  • @Stefanos: @Whom are you addressing with the IDE question? – tchrist Jan 02 '11 at 18:32
  • @tchrist: I am not familiar with those terms(script,block), but it was good that you pointed out this detail. I tried javac ‑encoding UTF‑8 and java ‑Dfile.encoding=UTF‑8 didn't work. – athspk Jan 02 '11 at 19:56
  • @Stefanos: I am using the latest Eclipse(3.6.1) if you are asking me. And jdk1.6.0_23 – athspk Jan 02 '11 at 19:58

4 Answers4

9

There are multiple places where transcoding errors can take place here.

  1. Ensure that your class is being compiled correctly (unlikely to be an issue in an IDE):
    • Ensure that the compiler is using the same encoding as your editor (i.e. if you save as UTF-8, set your compiler to use that encoding)
    • Or switch to escaping to the ASCII subset that most encodings are a superset of (i.e. change the string literal to "\u0394\u039e123456")
  2. Ensure you are reading input using the correct encoding:
    • Use the Console to read input - this class will detect the console encoding
    • Or configure your Reader to use the correct encoding (probably windows-1253) or set the console to Java's default encoding

Note that System.console() returns null in an IDE, but there are things you can do about that.

McDowell
  • 107,573
  • 31
  • 204
  • 267
8

If you use Windows, it may be caused by the fact that console character encoding ("OEM code page") is not the same as a system encoding ("ANSI code page").

InputStreamReader without explicit encoding parameter assumes input data to be in the system default encoding, therefore characters read from the console are decoded incorrectly.

In order to correctly read non-us-ascii characters in Windows console you need to specify console encoding explicitly when constructing InputStreamReader (required codepage number can be found by executing mode con cp in the command line):

BufferedReader br = new BufferedReader(
    new InputStreamReader(System.in, "CP737")); 

The same problem applies to the output, you need to construct PrintWriter with proper encoding:

PrintWriter out = new PrintWrtier(new OutputStreamWriter(System.out, "CP737"));

Note that since Java 1.6 you can avoid these workarounds by using Console object obtained from System.console(). It provides Reader and Writer with correctly configured encoding as well as some utility methods.

However, System.console() returns null when streams are redirected (for example, when running from IDE). A workaround for this problem can be found in McDowell's answer.

See also:

axtavt
  • 239,438
  • 41
  • 511
  • 482
  • I tried BufferedReader br = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); but didn't work. I also tried CP1253. – athspk Jan 02 '11 at 16:16
  • @athspk: You need OEM codepage of Greek Windows. It should be `CP737` or `CP869`, I'm not sure which one. – axtavt Jan 02 '11 at 16:20
  • @athspk: What's the result of `str2.length()`? – axtavt Jan 02 '11 at 16:33
  • @athspk: The output clearly says you have UTF-8 input interpreted as Windows-1252. Are you sure `new InputStreamReader(System.in, "UTF-8"))` doesn't work? – axtavt Jan 02 '11 at 16:45
  • It doesn't work, neither from the IDE, nor by running from command line. When i run from command line, str2 is ??123456, and the length is now 8 (same as str1) – athspk Jan 02 '11 at 17:11
  • @athspk: You can try reading input via `System.console()`, as suggested by McDowell. Also, what does `mode con cp` command show? – axtavt Jan 02 '11 at 17:22
  • mode con cp = 737. I tried new InputStreamReader(System.in, "737") before but didn't work WHEN RUNNING FROM IDE. Now i ran it from command line and the regexp matched! Output: ─╬123456-true. How do i get the string to print correctly? – athspk Jan 02 '11 at 17:51
  • @athspk: For output you need to create `PrintWriter` with proper encoding, the same way you do for input. Since Java 1.6 you also can avoid all these problems by using `System.console()` for input and output. – axtavt Jan 02 '11 at 18:03
1

I get true in both cases with nothing changed on your code. (I tested with greek layout keyboard - I'm from Greece :])
Probably your keyboard is sending ascii in 8859-7 ISO and not UTF-8. Mine sends UTF-8.

EDIT: I still get true with the addition of the equals command..

System.out.println(str1.equals(str2));


Check if you can get it working by changing everything to greek in the regional options (if you are using windows).

Rundll32 Shell32.dll,Control_RunDLL Intl.cpl,,0

If this is the case then you can act accordingly.. as 'axtavt' said

Stefanos Kalantzis
  • 1,619
  • 15
  • 23
0

The keyboard is likely not sending the characters as UTF-8, but as the operating system's default character encoding.

See also

Community
  • 1
  • 1
James Tikalsky
  • 3,856
  • 1
  • 21
  • 12