37

I am trying to write a filter function for my application that will take an input string and filter out all objects that don't match the given input in some way. The easiest way to do this would be to use String's contains method, i.e. just check if the object (the String variable in the object) contains the string specified in the filter, but this won't account for accents.

The objects in question are basically Persons, and the strings I am trying to match are names. So for example if someone searches for Joao I would expect Joáo to be included in the result set. I have already used the Collator class in my application to sort by name and it works well because it can do compare, i.e. using the UK Locale á comes before b but after a. But obvisouly it doesn't return 0 if you compare a and á because they are not equal.

So does anyone have any idea how I might be able to do this?

DaveJohnston
  • 10,031
  • 10
  • 54
  • 83
  • Possible duplicate of [Java. Ignore accents when comparing strings](http://stackoverflow.com/questions/2373213/java-ignore-accents-when-comparing-strings) – Barett Nov 01 '16 at 18:22

3 Answers3

98

Make use of java.text.Normalizer and a shot of regex to get rid of the diacritics.

public static String removeDiacriticalMarks(String string) {
    return Normalizer.normalize(string, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Which you can use as follows:

String value = "Joáo";
String comparisonMaterial = removeDiacriticalMarks(value); // Joao
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • 2
    I withdraw my answer! Never come across java.text.Normalizer, thanks for the tip – brabster Mar 07 '10 at 20:31
  • This is great. I was trying to do regex matches on non-ascii strings albeit unsuccessfully. The normalize seems to be the best way to do it. – ankimal Jun 23 '10 at 22:57
  • 1
    This is a poor answer. You need to use the [ICU Collator class](http://icu-project.org/apiref/icu4c/classCollator.html) to create a collator object with comparison strength set to PRIMARY. [This answer](http://stackoverflow.com/questions/5157141/how-do-you-match-accented-and-tilde-characters-in-a-perl-regular-expression-rege/5163247#5163247) shows how to do that from a Perl point of view. – tchrist Mar 05 '11 at 11:35
  • Great. Exactly what I was looking for. Thxs! – Pablo Alba Apr 26 '11 at 10:52
  • 1
    http://stackoverflow.com/questions/10812051/java-string-searching-ignoring-accents-part-ii – mark May 30 '12 at 07:44
  • http://stackoverflow.com/questions/2373213/java-ignore-accents-when-comparing-strings – Benny Bottema Oct 19 '16 at 13:18
  • 1
    Collator cannot be used to search in string, only to compare complete string, doesn't work in case of a seatch (expect for exact match!) Normalizer work well, but is slow, good for a single value, but not to search among a big set of value. – RiRomain Jan 26 '18 at 11:04
4

Collator does return 0 for a and á, if you configure it to ignore diacritics:

public boolean isSame(String a, String b) {
    Collator insenstiveStringComparator = Collator.getInstance();
    insenstiveStringComparator.setStrength(Collator.PRIMARY);
    // Collator.PRIMARY also works, but is case senstive
    return insenstiveStringComparator.compare(a, b) == 0;
}

isSame("a", "á") yields true now

Benny Bottema
  • 11,111
  • 10
  • 71
  • 96
0

I have written a class for searching trough arabic texts by ignoring diacritic (NOT removing them). maybe you can get the idea or use it in some way.

DiacriticInsensitiveSearch.java

mehdok
  • 1,499
  • 4
  • 29
  • 54