1

I'm experiencing an interesting issue at the moment.

I'm trying to read this file, which contains the 1000 most common english words in alphabetical order, in java:

http://www.file-upload.net/download-6679295/basicVocabulary.txt.html

This is a snippet at the beginning of the file:

a
able
about
above
according
account
across
act
action
added
afraid
after

My problem now is that, although it seems I'm reading the txt-file correctly, the first line is missing later on in my resultset/resultlist. In this case this is the letter "a", since it stands at the first position.

For making you able to reproduce my problem, please try this sample code with the txt-file above and see it for yourself (Don't forget to update the filepath). I have added the console output that comes for me in comments.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;

public class MyWrongBehaviour {

public static void main(String[] args){
    MyWrongBehaviour wrong = new MyWrongBehaviour(); 

    List<String> list = wrong.loadLanguageFile(); 

    System.out.println("size of the list: " + list.size()); //Answer is 1000, that's the correct size

    for(String s : list){
        System.out.println(s); // "a" will appear, so it is somehow included
    }

    if(list.contains("a")){
        System.out.println("found \"a\""); // doesn't get written on the console, can't find it
    }

    for(String s : list){
        if(s.equals("a")){
            System.out.println("found \"a\""); // never gets written, can't find it
        }
    }


}

private List<String> loadLanguageFile() {
    List<String> result = null;
    try (InputStream vocIn = getClass().getResourceAsStream(
            "/test/basicVocabulary.txt")) {

        if (vocIn == null) {
            throw new IllegalStateException(
                    "InputStream for the basic vocabulary must not be null");
        }

        BufferedReader in = new BufferedReader(new InputStreamReader(vocIn,
                "UTF-8"));

        String zeile = null;

        result = new ArrayList<>();
        while ((zeile = in.readLine()) != null) {
            result.add(zeile.trim());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

    return result;
}


}

Has someone an idea why this is happening and what I can do to fix it? My thoughts are that there might be a charset error, although I saved the file as UTF-8, or that there is somehow an invisible character that corrupts the file, but I don't know how to identify it.

Btw: I've used a Hashset before, but with a Set the first line didn't even get added. Now it gets added, but can't find it.

Thanks for every answer and thought you're sharing with me.

Waylander
  • 825
  • 2
  • 12
  • 34
  • Sorry man.. But that link is restricted in my office.. Still will go through your code to see what may be the problem.,. – Rohit Jain Oct 10 '12 at 13:35
  • I uploaded it again, try this link http://www.file-upload.net/download-6679295/basicVocabulary.txt.html – Waylander Oct 10 '12 at 13:36
  • There might be an issue of [ByteOrder Mark](http://en.wikipedia.org/wiki/Byte_order_mark) .. You might have a `U+FEFF` character at the starting of your file.. Look at the link for more information.. – Rohit Jain Oct 10 '12 at 13:39
  • And I think you have got a solution for this in an answer below by Jon Skeet.. :) – Rohit Jain Oct 10 '12 at 13:40
  • More discussion of this problem here: http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java – ChrisCantrell Oct 10 '12 at 13:48

1 Answers1

9

The file starts with a byte-order mark which indicates that it's UTF-8, so the first line is actually equivalent to "\ufeffa" (i.e. two characters, U+FEFF and then 'a'), which then isn't equal to "a".

One way of stripping this is just to use:

result.add(zeile.trim().replace("\ufeff", ""));

After that change, your code works as expected. There may be a better way of removing byte-order marks in Java, but I don't know it offhand.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thanks, this did the trick for me. Do you maybe also know a way to make these bom visible in file or console output? I'm just curious. – Waylander Oct 11 '12 at 10:32
  • @Waylander: I saw it when I opened it with my text editor - basically you want to use something sufficiently primitive that it *does* strip it automatically :) You can always look at the file in a binary editor with hex output etc. – Jon Skeet Oct 11 '12 at 10:43