26

So, I have an issue that really bothers me. I have a simple parser that I made in java. Here is the piece of relevant code:

while( (line = br.readLine())!=null)
{
    String splitted[] = line.split(SPLITTER);
    int docNum = Integer.parseInt(splitted[0].trim());
    //do something
}

Input file is CSV file, the first entry of the file being an integer. When I start parsing, I immidiately get this exception:

Exception in thread "main" java.lang.NumberFormatException: For input string: "1"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at dipl.parser.TableParser.parse(TableParser.java:50)
at dipl.parser.DocumentParser.main(DocumentParser.java:87)

I checked the file, it indeed has 1 as its first value (no other characters are in that field), but I still get the message. I think that it may be because of file encoding: it is UTF-8, with Unix endlines. And the program is run on Ubuntu 14.04. Any suggestions where to look for the problem are welcome.

Laurel
  • 5,965
  • 14
  • 31
  • 57
Milan Todorovic
  • 440
  • 4
  • 9

1 Answers1

38

You have a BOM in front of that number; if I copy what looks like "1" in your question and paste it into vim, I see that you have a FE FF (e.g., a BOM) in front of it. From that link:

The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format.

So that's the issue, consume the file with the appropriate reader for the transformation (UTF-8, UTF-16 big-endian, UTF-16 little-endian, etc.) the file is encoded with. See also this question and its answers for more about reading Unicode files in Java.

Community
  • 1
  • 1
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
  • 1
    @Doval: **Thank you,** I was absolutely wrong to say it was a UTF-8 BOM, and you're quite right that on-the-wire, the BOM for UTF-8 is EF BB BF. But what we're looking at is the *end result* of reading the file and then seeing the output in the error message. The file might be in any transformation; all BOMs end up being FE FF *once read*. – T.J. Crowder Sep 26 '16 at 18:12
  • But if it was read *raw*, then...oh, I don't know. :-) Could well have been UTF-16. :-) It'll all depend on how the file was read into the stream. – T.J. Crowder Sep 26 '16 at 18:29
  • 1
    "all BOMs end up being FE FF once read" - Not quite. All BOMs end up being U+FEFF (which is not the same as 0xFE 0xFF since it's a code point rather than a sequence of bytes) once *decoded*. Before decoding, all you have is bytes, which may be in any encoding that can represent Unicode characters (mostly UTF-8 and UTF-16 but others exist). – Kevin Sep 26 '16 at 19:59
  • @Kevin: Yes, that's what I meant. – T.J. Crowder Sep 27 '16 at 02:32