How can i compare two words which contain unicode characters?

Question

I am trying to create a language model which processes words and I am having problems as my corpus is in a foreign language and therefore has unicode characters such as ġ,ħ and ż but the .equals is not working on words with these letters even though I'm reading text from a text file and copying such words exactly. What can I do to fix this?

public class test3 {
  public static void main(String[] args) {
    Scanner s = new Scanner(System.in);
    String line;
    System.out.print("Enter string: ");
    line = s.nextLine();
    if(line.equals("aħħar")){
        System.out.println("Correct"); 
    } else {
        System.out.println("Incorrect");
    }
  }
}

I have entered the word 'aħħar' and keep getting "Incorrect".

Please add the actual code where u need help.. update you question asap :) — Vikrant Kashyap, Nov 07 '17 at 08:19
Equals works just fine on strings containing Unicode characters. You have a problem with how you are reading them. E.g. Are you using the correct CharSet? Are there non-printable characters that you haven't noticed (because they're non-printable)? — Andy Turner, Nov 07 '17 at 08:20
@searlea Given what OP wrote you cannot say that the reference you've given is a duplicate. See what Andy wrote. — laune, Nov 07 '17 at 08:26
Can you post an example of the code that is not working? See https://stackoverflow.com/help/mcve — Viktor Seifert, Nov 07 '17 at 08:30
@laune Sure, you're right. I hadn't considered this could be a charset encoding issue - I saw accents and thought this the most likely answer (I guess that's why stackoverflow defaults to inserting 'Possible' in a flag-for-duplicate comment) — searlea, Nov 07 '17 at 08:32
public class test3 { public static void main(String[] args) { Scanner s = new Scanner(System.in); String line; System.out.print("Enter string: "); line = s.nextLine(); if(line.equals("aħħar")){ System.out.println("Correct"); } else{ System.out.println("Incorrect"); } } } I have entered the word 'aħħar' and keep getting incorrect — A. Cam, Nov 07 '17 at 09:43

score 0 · Accepted Answer · answered Nov 07 '17 at 11:05

The most likely reason is that the default encoding for reading from standard input (via Scanner) does not match what your operating system uses.

Note that the constructors for Scanner have an additional parameter String charsetName for the encoding type used to convert bytes from the file into characters to be scanned. Add the appropriate value which may vary between operating systems and installations.

How can i compare two words which contain unicode characters?

1 Answers1