3

I'm reading a file with the following piece of code:

 Scanner in = new Scanner(new File(fileName));
    while (in.hasNextLine()) {
        String[] line = in.nextLine().trim().split("[ \t]");
       .
       .
       .
    }

When I open the file with the vim, some lines begin with the following special character:

enter image description here

but the java code can't read these lines. When it reaches these lines it thinks that it's the end of the file and hasNextLine() function returns false!!

EDIT: this is the hex dump of the mentioned (problematic) line:

0000000: e280 9c20 302e 3230 3133 3220 302e 3231 ... 0.20132 0.21 0000010: 3431 392d 302e 3034 0a 419-0.04.

Community
  • 1
  • 1
ayyoob imani
  • 639
  • 7
  • 16
  • 2
    @MaartenBodewes The zero byte is not an end-of-file byte. No Reader or Scanner will stop when reading it. – VGR Oct 01 '18 at 20:00
  • 2
    My guess is that one or more of the special characters are bytes with their highest bit set, which constitutes an invalid UTF-8 byte sequence. (`new Scanner(new File(fileName))` will read the file using the system’s default charset, which is UTF-8 on most non-Windows systems.) Print the value of `in.ioException()` after your loop finishes. – VGR Oct 01 '18 at 20:07
  • 1
    I'm unable to reproduce this issue as posted, or to reproduce what vim is showing. Can you please include a hex dump or base64 encoded snippet of the file content instead of a screenshot? – that other guy Oct 01 '18 at 20:19
  • @VGR actually no exception occurs if I use hasNextLine(), but if I try to read the line without checking its existence I will get the following exception: java.util.NoSuchElementException: No line found – ayyoob imani Oct 01 '18 at 20:21
  • 1
    @that other guy : I added a hex dump of the problematic line to the question!! – ayyoob imani Oct 01 '18 at 20:33

1 Answers1

2

@VGR got it right.

tl;dr: Use Scanner in = new Scanner(new File(fileName), "ISO-8859-1");

What appears to be happening is that:

  • Your file is not valid UTF-8 due to that lone 0x9C character.
  • The Scanner is reading the file as UTF-8 since this is the system default
  • The underlying libraries throw a MalformedInputException
  • The Scanner catches and hides it (a well meaning but misguided design decision)
  • It starts reporting that it has no more lines
  • You won't know anything's gone wrong unless you actually ask the Scanner

Here's a MCVE:

import java.io.*;
import java.util.*;

class Test {
  public static void main(String[] args) throws Exception {
    Scanner in = new Scanner(new File(args[0]), args[1]);
    while (in.hasNextLine()) {
      String line = in.nextLine();
      System.out.println("Line: " + line);
    }
    System.out.println("Exception if any: " + in.ioException());
  }
}

Here's an example of a normal invocation:

$ printf 'Hello\nWorld\n' > myfile && java Test myfile UTF-8
Line: Hello
Line: World
Exception if any: null

Here's what you're seeing (except that you don't retrieve and show the hidden exception). Notice in particular that no lines are shown:

$ printf 'Hello\nWorld \234\n' > myfile && java Test myfile UTF-8
Exception if any: java.nio.charset.MalformedInputException: Input length = 1

And here it is when decoded as ISO-8859-1, a decoding in which all byte sequences are valid (even though 0x9C has no assigned character and therefore doesn't show up in a terminal):

$ printf 'Hello\nWorld \234\n' > myfile && java Test myfile ISO-8859-1
Line: Hello
Line: World
Exception if any: null

If you're only interested in ASCII data and don't have any UTF-8 strings, you can simply ask the scanner to use ISO-8859-1 by passing it as a second parameter to the Scanner constructor:

Scanner in = new Scanner(new File(fileName), "ISO-8859-1");
that other guy
  • 116,971
  • 11
  • 170
  • 194