How I can use InCombiningDiacriticalMarks ignoring one case

Question

I'm writing code for remove all diacritics for one String.

For example: áÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑ

I'm using the property InCombiningDiacriticalMarks of Unicode. But I want to ignore the replace for ñ and Ñ.

Now I'm saving these two characters before replace with:

    s = s.replace('ñ', '\001');
    s = s.replace('Ñ', '\002');

It's possible to use InCombiningDiacriticalMarks ignoring the diacritic of ñ and Ñ.

This is my code:

public static String stripAccents(String s) 
{
    /*Save ñ*/
    s = s.replace('ñ', '\001');
    s = s.replace('Ñ', '\002');
    s = Normalizer.normalize(s, Normalizer.Form.NFD);
    s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
    /*Add ñ to s*/
    s = s.replace('\001', 'ñ');
    s = s.replace('\002', 'Ñ');

    return s;
}

It works fine, but I want know if it's possible optimize this code.

score 2 · Answer 1 · answered Mar 20 '20 at 06:32

It depends what you mean by "optimize". It's tough to reduce the number of lines of code from what you have written, but since you are processing the string six times there's scope to improve performance by processing the input string only once, character by character:

public class App {

    // See SO answer https://stackoverflow.com/a/10831704/2985643 by virgo47
    private static final String tab00c0
            = "AAAAAAACEEEEIIII"
            + "DNOOOOO\u00d7\u00d8UUUUYI\u00df"
            + "aaaaaaaceeeeiiii"
            + "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey"
            + "AaAaAaCcCcCcCcDd"
            + "DdEeEeEeEeEeGgGg"
            + "GgGgHhHhIiIiIiIi"
            + "IiJjJjKkkLlLlLlL"
            + "lLlNnNnNnnNnOoOo"
            + "OoOoRrRrRrSsSsSs"
            + "SsTtTtTtUuUuUuUu"
            + "UuUuWwYyYZzZzZzF";

    public static void main(String[] args) {
        var input = "AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ";
        var output = removeDiacritic(input);
        System.out.println("input  = " + input);
        System.out.println("output = " + output);
    }

    public static String removeDiacritic(String input) {
        var output = new StringBuilder(input.length());
        for (var c : input.toCharArray()) {
            if (isModifiable(c)) {
                c = tab00c0.charAt(c - '\u00c0');
            }
            output.append(c);
        }
        return output.toString();
    }

    // Returns true if the supplied char is a candidate for diacritic removal. 
    static boolean isModifiable(char c) {
        boolean modifiable;

        if (c < '\u00c0' || c > '\u017f') {
            modifiable = false;
        } else {
            modifiable = switch (c) {

                case 'ñ', 'Ñ' ->
                    false;
                default ->
                    true;
            };
        }
        return modifiable;
    }
}

This is the output from running the code:

input  = AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ
output = AaBbCcaAeEiIoOuUaAeEiIoOuUñÑcC

Characters without diacritics in the input string are not modified. Otherwise the diacritic is removed (e.g. Çto C), except in the cases of ñ and Ñ.

Notes:

The code does not use the Normalizer class or InCombiningDiacriticalMarks at all. Instead it processes each character in the input string only once, removing its accent if appropriate. The conventional approach for removing diacritics (as used in the OP) does not support selective removal as far as I know.
The code is based on an answer by user virgo47, but enhanced to support the selective removal of accents. See virgo47's answer for details of mapping an accented character to its unaccented counterpart.
This solution only works for Latin-1/Latin-2, but could be enhanced to support other mappings.
Your solution is very short and easy to understand, but it feels brittle, and for large input I suspect that it would be significantly slower than an approach that only processed each character once.

You don't need the `isModifiable` method. It's better change the `tab00c0` string to preserve the 'Ñ' and 'ñ' characters: `private static final String tab00c0 = "AAAAAAACEEEEIIIIDÑOOOOO\u00d7\u00d8UUUUYI\u00dfaaaaaaaceeeeiiii\u00f0ñooooo\u00f7\u00f8uuuuy\u00feyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiJjJjKkkLlLlLlLlLlNnNnNnnNnOoOoOoOoRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzF";` — Charliemops, Sep 08 '21 at 15:22
@Charliemops I think you should post an answer. I'll be happy to upvote if it improves the OP's code. — skomisa, Sep 08 '21 at 16:34

score 0 · Answer 2 · answered Oct 07 '22 at 06:22

Ave Maria Purisima,

You can create a pattern excluding the tilde from the diacritical marks set:

private static final Pattern STRIP_ACCENTS_PATTERN = Pattern.compile("[\\p{InCombiningDiacriticalMarks}&&[^\u0303]]+");

public static String stripAccents(String input) {
    if (input == null) {
        return null;
    }
    final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
    return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}

Hope it helps

How I can use InCombiningDiacriticalMarks ignoring one case

2 Answers2