How to convert accented characters in Java

Question

I'm using Java 1.5 and I need to normalize a String (like this àèìòù ---> aeiou). I can't use Normalizer because is for > 1.6 Any ideas?

I've tried this:

public String normalizeText(String text) {
    text = normalizer(text);
    text = text.replaceAll("\\p{InCombiningDiacriticalMarks}]", "");
    return text;
}

public static String normalizer(String word) {
    try {
        int i;
        Class<?> normalizerClass = Class.forName("java.text.Normalizer");
        Class<?> normalizerFormClass = null;
        Class<?>[] nestedClasses = normalizerClass.getDeclaredClasses();
        for (i = 0; i < nestedClasses.length; i++) {
            Class<?> nestedClass = nestedClasses[i];
            if (nestedClass.getName().equals("java.text.Normalizer$Form")) {
                normalizerFormClass = nestedClass;
            }
        }
        assert normalizerFormClass.isEnum();
        Method methodNormalize = normalizerClass.getDeclaredMethod(
                "normalize",
                CharSequence.class,
                normalizerFormClass);
        Object nfcNormalization = null;
        Object[] constants = normalizerFormClass.getEnumConstants();
        for (i = 0; i < constants.length; i++) {
            Object constant = constants[i];
            if (constant.toString().equals("NFC")) {
                nfcNormalization = constant;
            }
        }
        return (String) methodNormalize.invoke(null, word, nfcNormalization);
    } catch (Exception ex) { return null; }
}

I haven't tested it, but perhaps [this answer](http://stackoverflow.com/a/10831704/1682559) might work. It states that it should work for pre Java 6. You do need to know the range of the characters you want to convert and their order though, as explained in the answer. — Kevin Cruijssen, Apr 14 '16 at 15:00
What a horrible piece of code... where did you get that from? It totally unnecessarily uses reflection, making the program an order of magnitude more complicated and inefficient than necessary. And it's not magically going to make the Java 6 class `java.text.Normalizer` work on Java 5. — Jesper, Apr 14 '16 at 15:02

score 1 · Accepted Answer · edited May 23 '17 at 10:28

Make your own method

In case you cannot use Normaliser, there'd be also a nice way using Map, where you put all possible variations of letters to be normalized.

HashMap<Character, Character> rep = new HashMap<>();
rep.put("à","a");
rep.put("è","e");
rep.put("ì","i");
rep.put("ò","o");
rep.put("ù","u");
// etc...

That's quite long and awful, so loading from a text file is better.

Already existing answer

At this page I have found the following answer. It works, I have tested it:

Mirror of the unicode table from 00c0 to 017f without diacritics.

private static final String tab00c0 = "AAAAAAACEEEEIIII" +
    "DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
    "aaaaaaaceeeeiiii" +
    "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
    "AaAaAaCcCcCcCcDd" +
    "DdEeEeEeEeEeGgGg" +
    "GgGgHhHhIiIiIiIi" +
    "IiJjJjKkkLlLlLlL" +
    "lLlNnNnNnnNnOoOo" +
    "OoOoRrRrRrSsSsSs" +
    "SsTtTtTtUuUuUuUu" +
    "UuUuWwYyYZzZzZzF";

Returns string without diacritics - 7 bit approximation.

public static String removeDiacritic(String source) {
    char[] vysl = new char[source.length()];
    char one;
    for (int i = 0; i < source.length(); i++) {
        one = source.charAt(i);
        if (one >= '\u00c0' && one <= '\u017f') {
            one = tab00c0.charAt((int) one - '\u00c0');
        }
        vysl[i] = one;
    }
    return new String(vysl);
}

How to convert accented characters in Java

1 Answers1