1

I'm trying to parse a file into a Map. The text I'm trying to parse (displayed by sys out in log) is

10 przysuń hotel o 90 metrów. With each word separated by \t

The file is UTF-8 encoded.

Here's my method:

 private void readFile() {
    try {
        if (transcriptFile == null)
            transcriptFile = new File(transcriptPath);

        lines = Files.readAllLines(transcriptFile.toPath());
        for (String s : lines) {
            if (!s.isEmpty()) {
                List<String> parts = Arrays.asList(s.split("\t"));
                System.out.println(parts);

                int id = Integer.parseInt(parts.get(0).trim());
                parts.remove(0);
                String text = String.join(" ",parts);
                map.put(id,text);
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

And I'm getting this exception:

[10, przysuń, hotel, o, 90, metrów ]
java.lang.NumberFormatException: For input string: "10"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at Controller.readFile(Controller.java:143)
at Controller.access$000(Controller.java:29)
at Controller$SpeechTask.call(Controller.java:202)
at Controller$SpeechTask.call(Controller.java:154)
at javafx.concurrent.Task$TaskCallable.call(Task.java:1423)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

I see no reason why this would not be parsable.

Asalas77
  • 612
  • 4
  • 15
  • 26
  • Perhaps there are some invisible special characters in the string? What if you strip all non-digits first, try `int id = Integer.parseInt(parts.get(0).replaceAll("\\D+", ""));` – janos May 14 '17 at 14:15
  • 1
    Potentially a character encoding issue. – KevinO May 14 '17 at 14:17
  • Possible duplicate of [What is a NumberFormatException and how can I fix it?](http://stackoverflow.com/questions/39849984/what-is-a-numberformatexception-and-how-can-i-fix-it) – xenteros May 19 '17 at 12:52

3 Answers3

1

Your input file may contain a Byte Order Mark (BOM), a non-visible character which could be located immediately before the characters 10. Try the solution from this post or Apache's BOMInputStream

Non-programmatically, you could use Notepad++'s Encode in UTF-8 without BOM feature and save the input file.

Community
  • 1
  • 1
Reimeus
  • 158,255
  • 15
  • 216
  • 276
0

Can u replace the following line

List<String> parts = Arrays.asList(s.split("\t"));

With this and check once

List<String> parts = Arrays.asList(s.replaceAll("\\s+", ",").split(","));

And tell is the same exception coming again.

0

Seems this is because of a character encoding issue, notepad saved some additional characters in front of the file when defining the encoding as UTF-8.

When i try this, parts.get(0).trim() returned ?10 not 10 thus there was a NumberFormatException. Suppose if you have defined the 10 as a second word then parts.get(1).trim() would return 10 and there wouldn't be any NumberFormatException

Following question explain this issue Reading strange unicode character in Java?

Further Arrays.asList returns a list that can't be modified, so there could be a UnsupportedOperationException in parts.remove(0) even though Integer.parseInt success.

Community
  • 1
  • 1
friendlyBug
  • 518
  • 5
  • 12
  • regarding your last point, I changed it to `parts = new ArrayList<>(Arrays.asList(s.split("\t"));` so removing element should not be an issue. – Asalas77 May 14 '17 at 19:54