44

Is there any way in Android that (to my knowledge) doesn't have java.text.Normalizer, to remove any accent from a String. E.g "éàù" becomes "eau".

I'd like to avoid parsing the String to check each character if possible!

Johann
  • 12,158
  • 11
  • 62
  • 89
  • 2
    Android has [java.text.Normalizer](http://developer.android.com/reference/java/text/Normalizer.html) starting from API level 9 if you're using that (or later). – eldarerathis Dec 15 '11 at 16:51
  • 2
    Possible duplicate of http://stackoverflow.com/questions/6328654/android-2-3-and-java-text-normalizer – cyborg Dec 15 '11 at 16:51
  • 1
    If you are sorting or matching, take a look at [`Collator`](http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html); it is better than stripping accents yourself unless you need to display the result. – erickson Dec 15 '11 at 18:05

5 Answers5

99

java.text.Normalizer is there in Android (on latest versions anyway). You can use it.

EDIT For reference, here is how to use Normalizer:

string = Normalizer.normalize(string, Normalizer.Form.NFD);
string = string.replaceAll("[^\\p{ASCII}]", "");

(pasted from the link in comments below)

Guillaume
  • 22,694
  • 14
  • 56
  • 70
  • 2
    I'm trying to code with API level 7 for compatibility with older devices and I don't think it's there – Johann Dec 15 '11 at 16:57
  • Could you post an example on how to call `Normalizer.normalize` to remove accents? – Mister Smith Dec 15 '11 at 17:17
  • @Johann Added in API-9 http://developer.android.com/reference/java/text/Normalizer.html – shkschneider Oct 22 '14 at 09:51
  • 2
    @Guillaume, Hi, Im from Poland and its not working with "Ł" :s – Piszu Apr 25 '15 at 18:27
  • 1
    Well, this does strip everything that is non-Latin from the string, not just accents. – ntninja Aug 15 '15 at 23:58
  • 2
    `"[^\\p{ASCII}]"` will remove all non-ascii characters. You can use `"[\\p{M}]"` regexp instead to remove only the accents after decomposition. Source: [Remove Accents and Diacritics from String](https://memorynotfound.com/remove-accents-diacritics-from-string/). – Afilu Mar 30 '21 at 08:45
8

I've ajusted Rabi's solution to my needs, I hope it helps someone:

private static Map<Character, Character> MAP_NORM;
public static String removeAccents(String value)
{
    if (MAP_NORM == null || MAP_NORM.size() == 0)
    {
        MAP_NORM = new HashMap<Character, Character>();
        MAP_NORM.put('À', 'A');
        MAP_NORM.put('Á', 'A');
        MAP_NORM.put('Â', 'A');
        MAP_NORM.put('Ã', 'A');
        MAP_NORM.put('Ä', 'A');
        MAP_NORM.put('È', 'E');
        MAP_NORM.put('É', 'E');
        MAP_NORM.put('Ê', 'E');
        MAP_NORM.put('Ë', 'E');
        MAP_NORM.put('Í', 'I');
        MAP_NORM.put('Ì', 'I');
        MAP_NORM.put('Î', 'I');
        MAP_NORM.put('Ï', 'I');
        MAP_NORM.put('Ù', 'U');
        MAP_NORM.put('Ú', 'U');
        MAP_NORM.put('Û', 'U');
        MAP_NORM.put('Ü', 'U');
        MAP_NORM.put('Ò', 'O');
        MAP_NORM.put('Ó', 'O');
        MAP_NORM.put('Ô', 'O');
        MAP_NORM.put('Õ', 'O');
        MAP_NORM.put('Ö', 'O');
        MAP_NORM.put('Ñ', 'N');
        MAP_NORM.put('Ç', 'C');
        MAP_NORM.put('ª', 'A');
        MAP_NORM.put('º', 'O');
        MAP_NORM.put('§', 'S');
        MAP_NORM.put('³', '3');
        MAP_NORM.put('²', '2');
        MAP_NORM.put('¹', '1');
        MAP_NORM.put('à', 'a');
        MAP_NORM.put('á', 'a');
        MAP_NORM.put('â', 'a');
        MAP_NORM.put('ã', 'a');
        MAP_NORM.put('ä', 'a');
        MAP_NORM.put('è', 'e');
        MAP_NORM.put('é', 'e');
        MAP_NORM.put('ê', 'e');
        MAP_NORM.put('ë', 'e');
        MAP_NORM.put('í', 'i');
        MAP_NORM.put('ì', 'i');
        MAP_NORM.put('î', 'i');
        MAP_NORM.put('ï', 'i');
        MAP_NORM.put('ù', 'u');
        MAP_NORM.put('ú', 'u');
        MAP_NORM.put('û', 'u');
        MAP_NORM.put('ü', 'u');
        MAP_NORM.put('ò', 'o');
        MAP_NORM.put('ó', 'o');
        MAP_NORM.put('ô', 'o');
        MAP_NORM.put('õ', 'o');
        MAP_NORM.put('ö', 'o');
        MAP_NORM.put('ñ', 'n');
        MAP_NORM.put('ç', 'c');
    }

    if (value == null) {
        return "";
    }

    StringBuilder sb = new StringBuilder(value);

    for(int i = 0; i < value.length(); i++) {
        Character c = MAP_NORM.get(sb.charAt(i));
        if(c != null) {
            sb.setCharAt(i, c.charValue());
        }
    }

    return sb.toString();

}
Juarez Schulz
  • 124
  • 1
  • 6
  • 1
    Here are more conversions for Polish diacritics: MAP_NORM.put('Ą' to 'A'); MAP_NORM.put('Ę' to 'E'); MAP_NORM.put('Ć' to 'C'); MAP_NORM.put('Ł' to 'L'); MAP_NORM.put('Ń' to 'N'); MAP_NORM.put('Ś' to 'S'); MAP_NORM.put('Ź' to 'Z'); MAP_NORM.put('Ż' to 'Z'); MAP_NORM.put('ą' to 'a'); MAP_NORM.put('ę' to 'e'); MAP_NORM.put('ç' to 'c'); MAP_NORM.put('ć' to 'c'); MAP_NORM.put('ł' to 'l'); MAP_NORM.put('ń' to 'n'); MAP_NORM.put('ś' to 's'); MAP_NORM.put('ź' to 'z'); MAP_NORM.put('ż' to 'z'); – Michael Osofsky Jan 19 '21 at 23:01
5

This is probably not the most efficient solution but it will do the trick and it works in all Android versions:

private static Map<Character, Character> MAP_NORM;
static { // Greek characters normalization
    MAP_NORM = new HashMap<Character, Character>();
    MAP_NORM.put('ά', 'α');
    MAP_NORM.put('έ', 'ε');
    MAP_NORM.put('ί', 'ι');
    MAP_NORM.put('ό', 'ο');
    MAP_NORM.put('ύ', 'υ');
    MAP_NORM.put('ή', 'η');
    MAP_NORM.put('ς', 'σ');
    MAP_NORM.put('ώ', 'ω');
    MAP_NORM.put('Ά', 'α');
    MAP_NORM.put('Έ', 'ε');
    MAP_NORM.put('Ί', 'ι');
    MAP_NORM.put('Ό', 'ο');
    MAP_NORM.put('Ύ', 'υ');
    MAP_NORM.put('Ή', 'η');
    MAP_NORM.put('Ώ', 'ω');
}

public static String removeAccents(String s) {
    if (s == null) {
        return null;
    }
    StringBuilder sb = new StringBuilder(s);

    for(int i = 0; i < s.length(); i++) {
        Character c = MAP_NORM.get(sb.charAt(i));
        if(c != null) {
            sb.setCharAt(i, c.charValue());
        }
    }

    return sb.toString();
}
Rabi
  • 2,593
  • 1
  • 23
  • 26
3

While Guillaume's answer does work it strips all non-ASCII characters from the string. If you wish to preserve these try this code (where string is the string to simplify):

// Convert input string to decomposed Unicode (NFD) so that the
// diacritical marks used in many European scripts (such as the
// "C WITH CIRCUMFLEX" → ĉ) become separate characters.
// Also use compatibility decomposition (K) so that characters,
// that have the exact same meaning as one or more other
// characters (such as "㎏" → "kg" or "ヒ" → "ヒ"), match when
// comparing them.
string = Normalizer.normalize(string, Normalizer.Form.NFKD);

StringBuilder result = new StringBuilder();

int offset = 0, strLen = string.length();
while(offset < strLen) {
    int character = string.codePointAt(offset);
    offset += Character.charCount(character);

    // Only process characters that are not combining Unicode
    // characters. This way all the decomposed diacritical marks
    // (and some other not-that-important modifiers), that were
    // part of the original string or produced by the NFKD
    // normalizer above, disappear.
    switch(Character.getType(character)) {
        case Character.NON_SPACING_MARK:
        case Character.COMBINING_SPACING_MARK:
            // Some combining character found
        break;

        default:
            result.appendCodePoint(Character.toLowerCase(character));
    }
}

// Since we stripped all combining Unicode characters in the
// previous while-loop there should be no combining character
// remaining in the string and the composed and decomposed
// versions of the string should be equivalent. This also means
// we do not need to convert the string back to composed Unicode
// before returning it.
return result.toString();
ntninja
  • 1,204
  • 16
  • 20
  • Of course it does. If you want to keep Upper-Cases use `result.appendCodePoint(character);` instead of `result.appendCodePoint(Character.toLowerCase(character));`. ;-) – ntninja Sep 10 '15 at 17:35
0

All accented chartacters are in the extended ASCII character code set, with decimal values greater than 127. So you could enumerate all the characters in a string and if the decimal character code value is greater than 127, map it back to your desired equivalent. There is no easy way to map accented characters back to the non-accented counterparts - you would have to keep some sort of map in memory to map the extended decimal codes back to the unaccented characters.

Mike Marshall
  • 7,788
  • 4
  • 39
  • 63