Java IO fails to read text file

Question

when I try to read thesaurus.txt, it reads it as "ÿþ ", although the first entry is "<pat>a cappella". What could be causing this?

    File file = new File("thesaurus.txt");
    Scanner scan;
    try {
        scan = new Scanner(file);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        scan = null;
    }
    String entry;
    ArrayList<String> thes = new ArrayList<String>();
    while(scan.hasNext())
    {
        entry = scan.nextLine();
        if(entry != "")
        {
             thes.add(entry);
        }
    }
    return thes;

I tested the code and it works for me. My only guess would be that the file has a different character encoding than expected. — Johan Prins, Feb 20 '15 at 21:38
I agree with Johan. Sounds like an encoding issue. Try forcing a specific encoding. If you don't know what character encoding is, google it. "ASCII" and "UTF-8" are some examples. — Russell Uhl, Feb 20 '15 at 21:39

score 3 · Answer 1 · answered Feb 20 '15 at 22:55

Yout input file is probably an UTF-16 (LE) file that starts with a byte order mark.

If you look at this file as if it is ISO 8859-1 you'll see those two characters: ÿþ which have codes FF and FE in that character encoding, which are exactly what you would expect when there's a UTF-16 BOM present.

You should explicitly specify the character encoding when reading the file, instead of relying on the default character encoding of your system:

scan = new Scanner(file, "UTF-16");

Java IO fails to read text file

1 Answers1