4

I have a file in GB3212 encoding (Chinese). File is downloaded from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO as is with wget under Windows and stored into ModernChineseCharacterFrequencyList.html filename.

The code below demonstrates how Java is unable to read it up to end with one way and is able with another.

Namely, if Scanner is created with scanner = new Scanner(src, "GB2312") the code does not work. And if Scanner is created with scanner = new Scanner(new FileInputStream(src), "GB2312") then it DOES work.

Delimiter pattern lines just show another option with which the glitch remains.

public static void main(String[] args) throws FileNotFoundException {

    File src = new File("ModernChineseCharacterFrequencyList.html");
    //Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>");

    Scanner scanner;
    String line;

    //scanner = new Scanner(src, "GB2312"); // does NOT work
    scanner = new Scanner(new FileInputStream(src), "GB2312"); // does work


    //scanner.useDelimiter(frequencyDelimitingPattern);

    while(scanner.hasNext()) {
        line = scanner.next();
        System.out.println(line);
    }

}

Is this a glitch or by-design behavior?

UPDATE

When the code DOES work it just reads all tokens up to end. When it does NOT work it cancels reading approximately in the middle with no exception or error message.

No singularity at the break place was found. Nor did any "magic" numbers like 2^32 manifest.

UPDATE 2

Originally the behavior was found on Windows with Sun's JavaSE 1.6

And now the same behavior also found on Ubuntu with OpenJDK 1.6.0_23

Community
  • 1
  • 1
Dims
  • 47,675
  • 117
  • 331
  • 600

1 Answers1

1

I cannot test my answer right now but the JDK 6 documentation suggests different canonical names for encondings depending on the API you use: io or nio

JDK 6 Supportted Encondings

Maybe, instead of using "GB2312" you should use "EUC_CN" which is the suggested canonical name for Java I/O.

Edwin Dalorzo
  • 76,803
  • 25
  • 144
  • 205
  • 2
    No. GB2312 is not the same as EUC_CN. The first comes from the PRC, the second from the ROC. Further, using the wrong name should throw an exception, not silently do 'something else'. – bmargulies Jan 05 '12 at 13:53
  • Then there is indeed an error in the JDK Documentation since they are represented as equivalent canonical names. – Edwin Dalorzo Jan 05 '12 at 13:57
  • 1
    EUC_CN is one of two forms of GB2312: http://en.wikipedia.org/wiki/Extended_Unix_Code. It is a mystery why Java aliases them. Meanwhile, I'm quite sure that the 'io' APIs just call the 'nio' apis. – bmargulies Jan 05 '12 at 13:58
  • Well then it may be an initial clue to solve the mistery. If JDK encondings are incorrect it is only logical that things are not working. – Edwin Dalorzo Jan 05 '12 at 14:01