Comparing words with scandinavian characters

Question

My program reads with Scanner class few words from a file and compare them to users output. My editor is NetBeans and OS Windows 7. I first ran the program in NetBeans and had no problems. When I ran it in the command prompt, scandinavian characters (ä, ö, å, Ä, Ö, Å) didn't display correctly. Well, I tested and gave different parameters for Scanner, like ISO-8859-1 but it didn't help. Finally, I gave UTF-8 for it and characters also display well. But I got a new problem. I use equals method to compare two words. But now it doesn't "work". Though the words should be equals method returns false. If I haven't any character set for Scanner the program works well in NetBeans but not in the command prompt. So what can I do and why doesn't equals method work? Should I create my own comparing method or something?

public void readingWordsFromFile(String textfile){

try{
    File f = new File("WordLists\\" + textfile + ".txt" );
    Scanner l = new Scanner(f, "UTF-8");

    try{

    int i = 1;
    while( l.hasNext() ){

        String temp = l.nextLine();

        words.put(i, temp);

        i++;
    }
    }
    catch (Exception e){
    }
    finally{
    l.close();
    }
}
catch (Exception e){
}
}

Edit: "Solved". The answer doesn't relate to character sets. Files contained BOM because I had accidentally saved them with Notepad. So now I use again Notepad++ and everything is fine. : )

Do you mean the windows command prompt? If so, the windows cmd [is not UTF8 by default](http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how/388500#388500). The netbeans console is UTF8, so it works properly. — BackSlash, Aug 26 '14 at 07:56
You should provide sample code with demonstration of how do you use `scanner` and encodings — Andremoniy, Aug 26 '14 at 07:56
BTW, what does command `chcp` prints out in windows command promt? — Andremoniy, Aug 26 '14 at 07:58

score 0 · Answer 1 · answered Aug 26 '14 at 07:59

equals will not work when comparing two strings in different encondings - in terms of internal string representation - they are absolutelly different pieces of data
you should try set proper encoding for scanner, when using it in windows cmd. Try use command chcp in cmd to see which codepage is used inside it.

score 0 · Answer 2 · edited May 23 '17 at 10:25

0

The Windows cmd is not UTF8 by default. The netbeans console is UTF8, so it works properly.

In fact, if you type chcp in the console and press enter, you should see

Current active code table is: 850

Which is the ASCII Latin 1 charset.

edited May 23 '17 at 10:25

Community

1
1

answered Aug 26 '14 at 08:00

BackSlash

21,927
22
96
136

score 0 · Answer 3 · answered Aug 26 '14 at 08:09

Whenever possible use UTF-8, often one can pass StandardCharsets.UTF_8. For Swedish ISO-8859-4 is more suitable than the mentioned ISO-8859-1.

However one problem with Unicode is that an accented letter occurs as one Unicode code point (letter inclusive the accent), and as separate Unicode code points: ASCII letter and "combining diacritical mark" (the accent). For a text normalisation one might use java.text.Normalizer.

Usage of encodings in java suffer from many method and constructor signatures having an overloaded version without encoding defaulting to the operating system (or set) encoding.

In your case it looks like the latter. A Scanner without specified encoding, a FileReader/FileWriter, InputStreamReader, new String.

Comparing words with scandinavian characters

3 Answers3