6

Possible Duplicates:
Java. Ignore accents when comparing strings
Java string searching ignoring accents

Hi All

I need to compare strings in java that might be like 'Chloe' and 'Chloé'. I need them to be equal. Anyone knows what the best practice is ? Or is there some third-party library ?

Roman

Community
  • 1
  • 1
Roman
  • 7,933
  • 17
  • 56
  • 72
  • 2
    Actually, they aren't equal, unless the 2nd is the french of the the english one (which means, you'll have to translate it and do a comparison). – Buhake Sindi Nov 29 '10 at 11:52

3 Answers3

9

Have a look at International Components for Unicode, it can do what you need.

Edit: here's some sample code to get you started (from the Collator Javadoc):

// Get the Collator for US English and set its strength to PRIMARY
Collator usCollator = Collator.getInstance(Locale.US);
usCollator.setStrength(Collator.PRIMARY);
if (usCollator.compare("abc", "ABC") == 0) {
  System.out.println("Strings are equivalent");
}
Tassos Bassoukos
  • 16,017
  • 2
  • 36
  • 40
3

We translate the string "Chloé" to "Chloe" with hard-coded mappings between special characters and their equivalent ASCII character, before comparison. That works quite well but is clumsy and probably, there are some special characters which we have forgotten.

Our solution looks something like this:

public static String replaceAccents(String string) {
  String result = null;

  if (string != null) {
    result = string;

    result = result.replaceAll("[àáâãåä]", "a");
    result = result.replaceAll("[ç]", "c");
    result = result.replaceAll("[èéêë]", "e");
    result = result.replaceAll("[ìíîï]", "i");
    result = result.replaceAll("[ñ]", "n");
    result = result.replaceAll("[òóôõö]", "o");
    result = result.replaceAll("[ùúûü]", "u");
    result = result.replaceAll("[ÿý]", "y");

    result = result.replaceAll("[ÀÁÂÃÅÄ]", "A");
    result = result.replaceAll("[Ç]", "C");
    result = result.replaceAll("[ÈÉÊË]", "E");
    result = result.replaceAll("[ÌÍÎÏ]", "I");
    result = result.replaceAll("[Ñ]", "N");
    result = result.replaceAll("[ÒÓÔÕÖ]", "O");
    result = result.replaceAll("[ÙÚÛÜ]", "U");
    result = result.replaceAll("[Ý]", "Y");
  }

  return result;
}

So I'm curious about a good answer to this one!

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
  • Looks like a possible solution to me , but I am really curious about the performance of this one , I will be comparing a lot of strings in the end – Roman Nov 29 '10 at 12:28
  • 1
    This particular example can be replaced by `java.text.Normalizer`. See also [this answer](http://stackoverflow.com/questions/2397804/java-string-searching-ignoring-accents/2397830#2397830). – BalusC Nov 29 '10 at 12:37
  • Performance is OK in our case, because it is not invoked a lot of times. – Lukas Eder Nov 29 '10 at 12:48
  • Lucas the comment above is leads to a very elegant answer ! – Roman Nov 29 '10 at 12:48
0

What about stripAccent from Apache Commons ?

Removes the accents from a string.

NOTE: This is a JDK 1.6 method, it will fail on JDK 1.5.

 StringUtils.stripAccents(null)                = null
 StringUtils.stripAccents("")                  = ""
 StringUtils.stripAccents("control")           = "control"
 StringUtils.stripAccents("&ecute;clair")      = "eclair"


Parameters:
    input - String to be stripped 
Returns:
    String without accents on the text

they don't mention unicode encoding (and only give HTML example), you may want to give it a try anyway

Kevin
  • 4,618
  • 3
  • 38
  • 61
  • That's nice. Unfortunately, commons-lang 3.0 has been in beta state forever... Who knows when they will finally release that new version... – Lukas Eder Nov 29 '10 at 12:55