0

when I try to read thesaurus.txt, it reads it as "ÿþ ", although the first entry is "<pat>a cappella". What could be causing this?

    File file = new File("thesaurus.txt");
    Scanner scan;
    try {
        scan = new Scanner(file);
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
        scan = null;
    }
    String entry;
    ArrayList<String> thes = new ArrayList<String>();
    while(scan.hasNext())
    {
        entry = scan.nextLine();
        if(entry != "")
        {
             thes.add(entry);
        }
    }
    return thes;
skaffman
  • 398,947
  • 96
  • 818
  • 769
user383
  • 139
  • 5
  • 2
    I tested the code and it works for me. My only guess would be that the file has a different character encoding than expected. – Johan Prins Feb 20 '15 at 21:38
  • 1
    I agree with Johan. Sounds like an encoding issue. Try forcing a specific encoding. If you don't know what character encoding is, google it. "ASCII" and "UTF-8" are some examples. – Russell Uhl Feb 20 '15 at 21:39
  • http://stackoverflow.com/q/22763251/1544337 –  Feb 20 '15 at 22:04
  • Switching to ASCII worked, thank you. – user383 Feb 23 '15 at 14:58

1 Answers1

3

Yout input file is probably an UTF-16 (LE) file that starts with a byte order mark.

If you look at this file as if it is ISO 8859-1 you'll see those two characters: ÿþ which have codes FF and FE in that character encoding, which are exactly what you would expect when there's a UTF-16 BOM present.

You should explicitly specify the character encoding when reading the file, instead of relying on the default character encoding of your system:

scan = new Scanner(file, "UTF-16");
Jesper
  • 202,709
  • 46
  • 318
  • 350