58

The problem it's easy. Is there any function in JAVA to compare two Strings and return true ignoring the accented chars?

ie

String x = "Joao";
String y = "João";

return that are equal.

Thanks

framara
  • 2,833
  • 5
  • 28
  • 32
  • 9
    but those are __NOT__ equal, why would you want them to be equal when they are not? –  Mar 03 '10 at 16:55
  • 6
    @fuzzy both are usually the same name (it's the Portuguese version for John). Some people are just lazy to include accents – Samuel Carrijo Mar 03 '10 at 16:57
  • 1
    In Spanish, n and ñ are considered different letters. – Nicolás Mar 03 '10 at 17:01
  • 8
    Yeah, but by his example, it seems he wants to compare names and is not too worried about false positives – Samuel Carrijo Mar 03 '10 at 17:03
  • I'm spanish and it's not the same to mean that 'á' and 'a' are the same letter that 'n' and 'ñ' (talking about names, may be the same that is my need) – framara Mar 03 '10 at 17:06
  • @Framara: Please reconsider the accepted answer. – Lawrence Dol Mar 03 '10 at 21:51
  • either way, they aren't the same character to the computer they are 2 different unicode characters so they are by definition __NOT__ equal. You will have to roll your own comparator to get the incorrect behavior you are looking for. What you should be looking at is something like Metaphone. –  Mar 04 '10 at 15:45
  • 1
    They are not equal, but assumed equality is useful when comparing for sorting, or in filenames, where UTF-8 characters are not well-supported (e.g. in zip files...) – Lukas Eder Nov 29 '10 at 11:59
  • 10
    this can be extremely useful for searching. Users are too lazy to properly type accents on a qwerty keyboard. Maybe the question should be rephrased to determining whether two strings are **similar** instead of equal though. – Marijn van Vliet Jan 19 '12 at 10:16
  • In Spanish, n and ñ are considered different letters. Sorts between n and o. There's even a separate keyboard key. As far as I know, in German, "ö" should be considered equal to "oe", not "o". How are you going to handle all that? :) – Nicolás Mar 03 '10 at 17:04
  • 1
    This is very valid especially in systems that need to compare international data. 1- Probably very few systems in the world handle anything multilingual properly. Case in point it's mentioned in the threads below that even java has buggy Unicode support. 2- When you have services that accept data from 3rd parties that all goes down the tubes. Since no one ever handles the data consistently. 2- As mentioned before people just don't type data in properly at all. Either because lazy, typos etc... 3- Joao may as well be a Spanish user unfortunately using an English computer. – user432024 Aug 16 '13 at 19:25
  • @JarrodRoberson This is about functional equality and functional equality means contextual equality. Something can be functional equal to many different objects depending its use case, its context. It might not even be about code points at all, but instead about the number of pixels used to draw the symbols. There is no single definition of equality. Making words bold doesn't change that fact. – Benny Bottema Oct 19 '16 at 13:05
  • 1
    People here need to differentiate between technically equal and functionally equal. Technically these are not the same obviously. Functionally depending on your domain, use case and subsequent business logic, "Joao" can be equal to "João", "Jo" or even both "bob" and "1234" at the same time. In case the comparison stops at primary characters (unaccented base characters), a PRIMARY strength Collator fits the job perfectly. – Benny Bottema Apr 17 '20 at 08:04

6 Answers6

69

I think you should be using the Collator class. It allows you to set a strength and locale and it will compare characters appropriately.

From the Java 1.6 API:

You can set a Collator's strength property to determine the level of difference considered significant in comparisons. Four strengths are provided: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. The exact assignment of strengths to language features is locale dependant. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ě" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical.

I think the important point here (which people are trying to make) is that "Joao"and "João" should never be considered as equal, but if you are doing sorting you don't want them to be compared based on their ASCII value because then you would have something like Joao, John, João, which is not good. Using the collator class definitely handles this correctly.

Goldorak84
  • 3,714
  • 3
  • 38
  • 62
DaveJohnston
  • 10,031
  • 10
  • 54
  • 83
  • 3
    @Software Monkey: I agree too, even though I wrote the accepted answer. :-P – C. K. Young Mar 03 '10 at 19:39
  • 1
    FYI folks, created a bit of code [here](https://code.google.com/p/jjcommon/source/browse/trunk/src/main/java/com/jjcommon/JJStringUtils.java?spec=svn11&r=11#82) that follows your guidelines, so thanks for that. However I didn't see a way to do a comparison that is ACCENT insensitive, but CASE sensitive, following the Collator's rules... did I miss something? – Joao Coelho Feb 03 '12 at 19:54
  • 1
    @Joao you won't be able to do that with the Collator class because the strength is set as the minimum level. So to get case sensitivity you need TERTIARY, but for accent insensitivity you only want PRIMARY. So they won't work together. You might be better using Chris Jester-Youngs solution to filter off the accent characters then compare the strings normally. – DaveJohnston Feb 06 '12 at 10:30
26

You didn't hear this from me (because I disagree with the premise of the question), but, you can use java.text.Normalizer, and normalize with NFD: this splits off the accent from the letter it's attached to. You can then filter off the accent characters and compare.

C. K. Young
  • 219,335
  • 46
  • 382
  • 435
  • 6
    The two steps are combined into one by StringUtils.stripAccents http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringUtils.html – cquezel May 31 '13 at 19:18
  • 1
    This is very valid especially in systems that need to compare international data. 1- Probably very few systems in the world handle anything multilingual properly. Case in point it's mentioned in the threads below that even java has buggy Unicode support. 2- When you have services that accept data from 3rd parties that all goes down the tubes. Since no one ever handles the data consistently. 2- As mentioned before people just don't type data in properly at all. Either because lazy, typos etc... 3- Joao may as well be a Spanish user unfortunately using an English computer. – user432024 Aug 16 '13 at 19:27
10

Or use stripAccents from apache StringUtils library if you want to compare/sort ignoring accents :

 public int compareStripAccent(String a, String b) {
    return StringUtils.stripAccents(a).compareTo(StringUtils.stripAccents(b));
}
Daniel
  • 3,813
  • 1
  • 13
  • 11
9

Java's Collator returns 0 for both "a" and "á", if you configure it to ignore diacritics:

public boolean isSame(String a, String b) {
    Collator insenstiveStringComparator = Collator.getInstance();
    insenstiveStringComparator.setStrength(Collator.PRIMARY);
    return insenstiveStringComparator.compare(a, b) == 0;
}

isSame("a", "á") yields true

Benny Bottema
  • 11,111
  • 10
  • 71
  • 96
2
public boolean insenstiveStringComparator (String a, String b) {
    java.text.Collator collate = java.text.Collator.getInstance();
    collate.setStrength(java.text.Collator.PRIMARY);
    collate.setDecomposition(java.text.Collator.CANONICAL_DECOMPOSITION); 
    return collate.equals(a, b);    
}
-2

The problem with these sort of conversions is that there isn't always a clear-cut mapping from accented to non-accented characters. It depends on codepages, localizations, etc. For example, is this a with an accent equivalent to an "a"? Not a problem for a human, but trickier for the computer.

AFAIK Java does not have a built in conversion that can look up the current localization options and make these sort of conversions. You may need some external library that handles unicode better, like ICU (http://site.icu-project.org/ )

Uri
  • 88,451
  • 51
  • 221
  • 321
  • Java does have it, it's called the [Collater](http://docs.oracle.com/javase/tutorial/i18n/text/locale.html) and is specifically made for this kind of problem. – Benny Bottema Oct 19 '16 at 13:21