2

So I'm reading a plain text file in Java, and I'd like do identify which lines start with "abc". I did the following:

Charset charset = StandardCharsets.UTF_8;
BufferedReader br = Files.newBufferedReader(file.toAbsolutePath(), charset);
String line;
while ((line = br.readLine()) != null) {
   if (line.startsWith("abc")) {
       // Do something
   }
}

But if the first line of the file is "abcd", it won't match. By debugging I've found out that the first character is a 0 (non-printable character), and because of this it won't match. Why is that so? How could I robustly identify which lines start with "abc"?

EDIT: perhaps I should point out that I'm creating the file using notepad

Thiago
  • 2,238
  • 4
  • 29
  • 42

1 Answers1

2

Windows has a few problems with UTF-8, and as such it is a heavy user of the UTF-8 BOM (Byte Order Mark).

If my guess is correct, the first three bytes would then be (in hexadecimal): 0xef, 0xbb, 0xbf.

Given that, for instance, Excel creates UTF-8 CSV files with a BOM prefix, I wouldn't be surprised at all if Notepad did as well...

edit: not surprisingly, it seems this is the case: see here.

Community
  • 1
  • 1
fge
  • 119,121
  • 33
  • 254
  • 329
  • Correct! It was a question mark (and not 0 as I've stated). Thanks a lot! – Thiago Jun 08 '13 at 04:38
  • 1
    But he is reading using a UTF-8 Reader, so the BOM would be a single `char` with value 0xFEFF. If it was then output to (for example) a LATIN-1 console, he would see a `?`. – Stephen C Jun 08 '13 at 04:38
  • @StephenC uh, yeah, I meant the first three bytes, sorry... Fixed – fge Jun 08 '13 at 04:40
  • @Thiago if you have to process a lot of files issued from Windows, you may also encounter UTF-16... PowerShell writes files with this encoding by default! – fge Jun 08 '13 at 04:45
  • @StephenC note though that this is a _UTF-8_ BOM, note UTF-16 or UTF-32 – fge Jun 08 '13 at 04:50
  • @fge - Yes. And the UTF-8 BOM mark translates to 0xFEFF when you feed it through a UTF-8 to Unicode (or UTF-16) translator... like I said. – Stephen C Jun 08 '13 at 07:08