@VGR got it right.
tl;dr: Use Scanner in = new Scanner(new File(fileName), "ISO-8859-1");
What appears to be happening is that:
- Your file is not valid UTF-8 due to that lone 0x9C character.
- The Scanner is reading the file as UTF-8 since this is the system default
- The underlying libraries throw a
MalformedInputException
- The Scanner catches and hides it (a well meaning but misguided design decision)
- It starts reporting that it has no more lines
- You won't know anything's gone wrong unless you actually ask the Scanner
Here's a MCVE:
import java.io.*;
import java.util.*;
class Test {
public static void main(String[] args) throws Exception {
Scanner in = new Scanner(new File(args[0]), args[1]);
while (in.hasNextLine()) {
String line = in.nextLine();
System.out.println("Line: " + line);
}
System.out.println("Exception if any: " + in.ioException());
}
}
Here's an example of a normal invocation:
$ printf 'Hello\nWorld\n' > myfile && java Test myfile UTF-8
Line: Hello
Line: World
Exception if any: null
Here's what you're seeing (except that you don't retrieve and show the hidden exception). Notice in particular that no lines are shown:
$ printf 'Hello\nWorld \234\n' > myfile && java Test myfile UTF-8
Exception if any: java.nio.charset.MalformedInputException: Input length = 1
And here it is when decoded as ISO-8859-1, a decoding in which all byte sequences are valid (even though 0x9C has no assigned character and therefore doesn't show up in a terminal):
$ printf 'Hello\nWorld \234\n' > myfile && java Test myfile ISO-8859-1
Line: Hello
Line: World
Exception if any: null
If you're only interested in ASCII data and don't have any UTF-8 strings, you can simply ask the scanner to use ISO-8859-1
by passing it as a second parameter to the Scanner
constructor:
Scanner in = new Scanner(new File(fileName), "ISO-8859-1");