Java unicode comparison

Question

Possible Duplicates:
Java. Ignore accents when comparing strings
Java string searching ignoring accents

Hi All

I need to compare strings in java that might be like 'Chloe' and 'Chloé'. I need them to be equal. Anyone knows what the best practice is ? Or is there some third-party library ?

Roman

Actually, they aren't equal, unless the 2nd is the french of the the english one (which means, you'll have to translate it and do a comparison). — Buhake Sindi, Nov 29 '10 at 11:52

score 9 · Answer 1 · answered Nov 29 '10 at 12:10

9

Have a look at International Components for Unicode, it can do what you need.

Edit: here's some sample code to get you started (from the Collator Javadoc):

// Get the Collator for US English and set its strength to PRIMARY
Collator usCollator = Collator.getInstance(Locale.US);
usCollator.setStrength(Collator.PRIMARY);
if (usCollator.compare("abc", "ABC") == 0) {
  System.out.println("Strings are equivalent");
}

answered Nov 29 '10 at 12:10

Tassos Bassoukos

16,017
2
36
40

This is the only correct answer. – tchrist Mar 05 '11 at 11:33

Lukas Eder · Answer 2 · 2010-11-29T12:00:40.937

We translate the string "Chloé" to "Chloe" with hard-coded mappings between special characters and their equivalent ASCII character, before comparison. That works quite well but is clumsy and probably, there are some special characters which we have forgotten.

Our solution looks something like this:

public static String replaceAccents(String string) {
  String result = null;

  if (string != null) {
    result = string;

    result = result.replaceAll("[àáâãåä]", "a");
    result = result.replaceAll("[ç]", "c");
    result = result.replaceAll("[èéêë]", "e");
    result = result.replaceAll("[ìíîï]", "i");
    result = result.replaceAll("[ñ]", "n");
    result = result.replaceAll("[òóôõö]", "o");
    result = result.replaceAll("[ùúûü]", "u");
    result = result.replaceAll("[ÿý]", "y");

    result = result.replaceAll("[ÀÁÂÃÅÄ]", "A");
    result = result.replaceAll("[Ç]", "C");
    result = result.replaceAll("[ÈÉÊË]", "E");
    result = result.replaceAll("[ÌÍÎÏ]", "I");
    result = result.replaceAll("[Ñ]", "N");
    result = result.replaceAll("[ÒÓÔÕÖ]", "O");
    result = result.replaceAll("[ÙÚÛÜ]", "U");
    result = result.replaceAll("[Ý]", "Y");
  }

  return result;
}

So I'm curious about a good answer to this one!

Looks like a possible solution to me , but I am really curious about the performance of this one , I will be comparing a lot of strings in the end — Roman, Nov 29 '10 at 12:28
This particular example can be replaced by `java.text.Normalizer`. See also [this answer](http://stackoverflow.com/questions/2397804/java-string-searching-ignoring-accents/2397830#2397830). — BalusC, Nov 29 '10 at 12:37
Performance is OK in our case, because it is not invoked a lot of times. — Lukas Eder, Nov 29 '10 at 12:48

score 0 · Answer 3 · answered Nov 29 '10 at 12:11

What about stripAccent from Apache Commons ?

Removes the accents from a string.

NOTE: This is a JDK 1.6 method, it will fail on JDK 1.5.

 StringUtils.stripAccents(null)                = null
 StringUtils.stripAccents("")                  = ""
 StringUtils.stripAccents("control")           = "control"
 StringUtils.stripAccents("&ecute;clair")      = "eclair"


Parameters:
    input - String to be stripped 
Returns:
    String without accents on the text

they don't mention unicode encoding (and only give HTML example), you may want to give it a try anyway

That's nice. Unfortunately, commons-lang 3.0 has been in beta state forever... Who knows when they will finally release that new version... — Lukas Eder, Nov 29 '10 at 12:55

Java unicode comparison

3 Answers3