1

I have a String in a file which is supposed to be read in using the nextLine() method in the Scanner class in the following way:

some_string = "All the staff in the operating room has been specifically trained with a theoretical and practical 20-hour course.\xe2\x80\xa9Results: The overall average incidence of adverse events reported was determined by 4.8%, is consistent with the expectations of the study protocol, and is at a lower level than the average median rate of international studies (8.9%).\n"

I create a scanner object in the following way:

 Scanner br = new Scanner(new File("location of my file"), "UTF-8");

then i get the next lines by doing:

while (br.hasNextLine()) {
       System.out.println(br.nextLine());
}

and I get:

>All the staff in the operating room has been specifically trained with a theoretical and practical 20-hour course.
>Results: The overall average incidence of adverse events reported was determined by 4.8%, is consistent with the expectations of the study protocol, and is at a lower level than the average median rate of international studies (8.9%).

It seems that nextLine() is failing when there are non ASCII characters. Any ideas why this happens?

kolonel
  • 1,412
  • 2
  • 16
  • 33
  • Are you sure that the file is encoded as UTF-8? – Dawood ibn Kareem Apr 04 '14 at 02:51
  • @DavidWallace yes. Upon further thinking I notice that the sequence of '\xe2\x80\xa9' is some form of paragraph splitter from here http://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string-literal – kolonel Apr 04 '14 at 03:15
  • @DavidWallace Any ideas on how I can avoid anything that is not a new line character? – kolonel Apr 04 '14 at 03:15
  • You could try `br.next("[^\\n]*\\n")` instead of `br.nextLine()`. I haven't tested it so I have no idea whether it works, but given what the Javadoc says, it seems likely. If it works, let me know and I'll convert this comment to an answer. – Dawood ibn Kareem Apr 04 '14 at 03:20
  • That hex sequence is U+2029. See also http://stackoverflow.com/questions/5918896 – McDowell Apr 04 '14 at 05:16

2 Answers2

2

try this:

    Scanner scanner = new Scanner(new File("the file"), "UTF-8").useDelimiter("\n");

    while (scanner.hasNext())
        System.out.println(scanner.next());
Scott
  • 1,648
  • 13
  • 21
  • thanks, but the next() method is returning a no such element exception, any ideas? – kolonel Apr 04 '14 at 04:04
  • On the same line? Or is this from a different part of the file? – Scott Apr 04 '14 at 04:08
  • I am not sure unless you have closed the underlying stream - http://stackoverflow.com/questions/13042008/java-util-nosuchelementexception-scanner-reading-user-input – Scott Apr 04 '14 at 04:19
1

I'm fighting right now with this problem, unfortunately Scanner doesn't work with non-ascii characters, so when it reaches a non-ascii character it acts as the file is ended. That's the reason why hasNext or hasNextLine return false! You can change method and use BufferedReader for reading the file.

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
    System.out.println(line);
}