I have a file in GB3212 encoding (Chinese). File is downloaded from here http://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=MO as is with wget under Windows and stored into ModernChineseCharacterFrequencyList.html filename.
The code below demonstrates how Java is unable to read it up to end with one way and is able with another.
Namely, if Scanner
is created with scanner = new Scanner(src, "GB2312")
the code does not work. And if Scanner
is created with scanner = new Scanner(new FileInputStream(src), "GB2312")
then it DOES work.
Delimiter pattern lines just show another option with which the glitch remains.
public static void main(String[] args) throws FileNotFoundException {
File src = new File("ModernChineseCharacterFrequencyList.html");
//Pattern frequencyDelimitingPattern = Pattern.compile("<br>|<pre>|</pre>");
Scanner scanner;
String line;
//scanner = new Scanner(src, "GB2312"); // does NOT work
scanner = new Scanner(new FileInputStream(src), "GB2312"); // does work
//scanner.useDelimiter(frequencyDelimitingPattern);
while(scanner.hasNext()) {
line = scanner.next();
System.out.println(line);
}
}
Is this a glitch or by-design behavior?
UPDATE
When the code DOES work it just reads all tokens up to end. When it does NOT work it cancels reading approximately in the middle with no exception or error message.
No singularity at the break place was found. Nor did any "magic" numbers like 2^32 manifest.
UPDATE 2
Originally the behavior was found on Windows with Sun's JavaSE 1.6
And now the same behavior also found on Ubuntu with OpenJDK 1.6.0_23