Remove accents from String

Question

Is there any way in Android that (to my knowledge) doesn't have java.text.Normalizer, to remove any accent from a String. E.g "éàù" becomes "eau".

I'd like to avoid parsing the String to check each character if possible!

Android has [java.text.Normalizer](http://developer.android.com/reference/java/text/Normalizer.html) starting from API level 9 if you're using that (or later). — eldarerathis, Dec 15 '11 at 16:51
Possible duplicate of http://stackoverflow.com/questions/6328654/android-2-3-and-java-text-normalizer — cyborg, Dec 15 '11 at 16:51
If you are sorting or matching, take a look at [`Collator`](http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html); it is better than stripping accents yourself unless you need to display the result. — erickson, Dec 15 '11 at 18:05

Guillaume · Accepted Answer · 2011-12-15T18:01:36.980

99

java.text.Normalizer is there in Android (on latest versions anyway). You can use it.

EDIT For reference, here is how to use Normalizer:

string = Normalizer.normalize(string, Normalizer.Form.NFD);
string = string.replaceAll("[^\\p{ASCII}]", "");

(pasted from the link in comments below)

edited Dec 15 '11 at 18:01

answered Dec 15 '11 at 16:54

Guillaume

22,694
14
56
70

2

I'm trying to code with API level 7 for compatibility with older devices and I don't think it's there – Johann Dec 15 '11 at 16:57
Could you post an example on how to call `Normalizer.normalize` to remove accents? – Mister Smith Dec 15 '11 at 17:17
@Johann Added in API-9 http://developer.android.com/reference/java/text/Normalizer.html – shkschneider Oct 22 '14 at 09:51
2

@Guillaume, Hi, Im from Poland and its not working with "Ł" :s – Piszu Apr 25 '15 at 18:27
1

Well, this does strip everything that is non-Latin from the string, not just accents. – ntninja Aug 15 '15 at 23:58
2

`"[^\\p{ASCII}]"` will remove all non-ascii characters. You can use `"[\\p{M}]"` regexp instead to remove only the accents after decomposition. Source: [Remove Accents and Diacritics from String](https://memorynotfound.com/remove-accents-diacritics-from-string/). – Afilu Mar 30 '21 at 08:45

score 8 · Answer 2 · answered Nov 29 '13 at 15:59

I've ajusted Rabi's solution to my needs, I hope it helps someone:

private static Map<Character, Character> MAP_NORM;
public static String removeAccents(String value)
{
    if (MAP_NORM == null || MAP_NORM.size() == 0)
    {
        MAP_NORM = new HashMap<Character, Character>();
        MAP_NORM.put('À', 'A');
        MAP_NORM.put('Á', 'A');
        MAP_NORM.put('Â', 'A');
        MAP_NORM.put('Ã', 'A');
        MAP_NORM.put('Ä', 'A');
        MAP_NORM.put('È', 'E');
        MAP_NORM.put('É', 'E');
        MAP_NORM.put('Ê', 'E');
        MAP_NORM.put('Ë', 'E');
        MAP_NORM.put('Í', 'I');
        MAP_NORM.put('Ì', 'I');
        MAP_NORM.put('Î', 'I');
        MAP_NORM.put('Ï', 'I');
        MAP_NORM.put('Ù', 'U');
        MAP_NORM.put('Ú', 'U');
        MAP_NORM.put('Û', 'U');
        MAP_NORM.put('Ü', 'U');
        MAP_NORM.put('Ò', 'O');
        MAP_NORM.put('Ó', 'O');
        MAP_NORM.put('Ô', 'O');
        MAP_NORM.put('Õ', 'O');
        MAP_NORM.put('Ö', 'O');
        MAP_NORM.put('Ñ', 'N');
        MAP_NORM.put('Ç', 'C');
        MAP_NORM.put('ª', 'A');
        MAP_NORM.put('º', 'O');
        MAP_NORM.put('§', 'S');
        MAP_NORM.put('³', '3');
        MAP_NORM.put('²', '2');
        MAP_NORM.put('¹', '1');
        MAP_NORM.put('à', 'a');
        MAP_NORM.put('á', 'a');
        MAP_NORM.put('â', 'a');
        MAP_NORM.put('ã', 'a');
        MAP_NORM.put('ä', 'a');
        MAP_NORM.put('è', 'e');
        MAP_NORM.put('é', 'e');
        MAP_NORM.put('ê', 'e');
        MAP_NORM.put('ë', 'e');
        MAP_NORM.put('í', 'i');
        MAP_NORM.put('ì', 'i');
        MAP_NORM.put('î', 'i');
        MAP_NORM.put('ï', 'i');
        MAP_NORM.put('ù', 'u');
        MAP_NORM.put('ú', 'u');
        MAP_NORM.put('û', 'u');
        MAP_NORM.put('ü', 'u');
        MAP_NORM.put('ò', 'o');
        MAP_NORM.put('ó', 'o');
        MAP_NORM.put('ô', 'o');
        MAP_NORM.put('õ', 'o');
        MAP_NORM.put('ö', 'o');
        MAP_NORM.put('ñ', 'n');
        MAP_NORM.put('ç', 'c');
    }

    if (value == null) {
        return "";
    }

    StringBuilder sb = new StringBuilder(value);

    for(int i = 0; i < value.length(); i++) {
        Character c = MAP_NORM.get(sb.charAt(i));
        if(c != null) {
            sb.setCharAt(i, c.charValue());
        }
    }

    return sb.toString();

}

Here are more conversions for Polish diacritics: MAP_NORM.put('Ą' to 'A'); MAP_NORM.put('Ę' to 'E'); MAP_NORM.put('Ć' to 'C'); MAP_NORM.put('Ł' to 'L'); MAP_NORM.put('Ń' to 'N'); MAP_NORM.put('Ś' to 'S'); MAP_NORM.put('Ź' to 'Z'); MAP_NORM.put('Ż' to 'Z'); MAP_NORM.put('ą' to 'a'); MAP_NORM.put('ę' to 'e'); MAP_NORM.put('ç' to 'c'); MAP_NORM.put('ć' to 'c'); MAP_NORM.put('ł' to 'l'); MAP_NORM.put('ń' to 'n'); MAP_NORM.put('ś' to 's'); MAP_NORM.put('ź' to 'z'); MAP_NORM.put('ż' to 'z'); — Michael Osofsky, Jan 19 '21 at 23:01

score 5 · Answer 3 · answered Mar 24 '12 at 07:13

This is probably not the most efficient solution but it will do the trick and it works in all Android versions:

private static Map<Character, Character> MAP_NORM;
static { // Greek characters normalization
    MAP_NORM = new HashMap<Character, Character>();
    MAP_NORM.put('ά', 'α');
    MAP_NORM.put('έ', 'ε');
    MAP_NORM.put('ί', 'ι');
    MAP_NORM.put('ό', 'ο');
    MAP_NORM.put('ύ', 'υ');
    MAP_NORM.put('ή', 'η');
    MAP_NORM.put('ς', 'σ');
    MAP_NORM.put('ώ', 'ω');
    MAP_NORM.put('Ά', 'α');
    MAP_NORM.put('Έ', 'ε');
    MAP_NORM.put('Ί', 'ι');
    MAP_NORM.put('Ό', 'ο');
    MAP_NORM.put('Ύ', 'υ');
    MAP_NORM.put('Ή', 'η');
    MAP_NORM.put('Ώ', 'ω');
}

public static String removeAccents(String s) {
    if (s == null) {
        return null;
    }
    StringBuilder sb = new StringBuilder(s);

    for(int i = 0; i < s.length(); i++) {
        Character c = MAP_NORM.get(sb.charAt(i));
        if(c != null) {
            sb.setCharAt(i, c.charValue());
        }
    }

    return sb.toString();
}

score 3 · Answer 4 · answered Aug 16 '15 at 00:07

While Guillaume's answer does work it strips all non-ASCII characters from the string. If you wish to preserve these try this code (where string is the string to simplify):

// Convert input string to decomposed Unicode (NFD) so that the
// diacritical marks used in many European scripts (such as the
// "C WITH CIRCUMFLEX" → ĉ) become separate characters.
// Also use compatibility decomposition (K) so that characters,
// that have the exact same meaning as one or more other
// characters (such as "㎏" → "kg" or "ﾋ" → "ヒ"), match when
// comparing them.
string = Normalizer.normalize(string, Normalizer.Form.NFKD);

StringBuilder result = new StringBuilder();

int offset = 0, strLen = string.length();
while(offset < strLen) {
    int character = string.codePointAt(offset);
    offset += Character.charCount(character);

    // Only process characters that are not combining Unicode
    // characters. This way all the decomposed diacritical marks
    // (and some other not-that-important modifiers), that were
    // part of the original string or produced by the NFKD
    // normalizer above, disappear.
    switch(Character.getType(character)) {
        case Character.NON_SPACING_MARK:
        case Character.COMBINING_SPACING_MARK:
            // Some combining character found
        break;

        default:
            result.appendCodePoint(Character.toLowerCase(character));
    }
}

// Since we stripped all combining Unicode characters in the
// previous while-loop there should be no combining character
// remaining in the string and the composed and decomposed
// versions of the string should be equivalent. This also means
// we do not need to convert the string back to composed Unicode
// before returning it.
return result.toString();

Of course it does. If you want to keep Upper-Cases use `result.appendCodePoint(character);` instead of `result.appendCodePoint(Character.toLowerCase(character));`. ;-) — ntninja, Sep 10 '15 at 17:35

score 0 · Answer 5 · answered Dec 15 '11 at 16:51

All accented chartacters are in the extended ASCII character code set, with decimal values greater than 127. So you could enumerate all the characters in a string and if the decimal character code value is greater than 127, map it back to your desired equivalent. There is no easy way to map accented characters back to the non-accented counterparts - you would have to keep some sort of map in memory to map the extended decimal codes back to the unaccented characters.

Remove accents from String

5 Answers5

Linked

Related