4

I'd like to normalize any extended ascii characters, but exclude umlauts.

If I'd like to include umlauts, I would go for:

Normalizer.normalize(value, Normalizer.Form.NFKD)
    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

But how can I exclude german umlauts?

As a result I would like to get:

source: üöäâÇæôøñÁ

desired result: üöäaCaeoonA or similar

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
membersound
  • 81,582
  • 193
  • 585
  • 1,120
  • 2
    what do you mean normalize extended characters? – Bozho Jul 03 '14 at 08:02
  • `â -> a` for example. That's what the normalizer does. – membersound Jul 03 '14 at 08:12
  • 2
    Just an idea: replace the umlauts (and their decomposed forms!) first with something else not affected by normalization, do the normalization, and then replace them back. – ankon Jul 03 '14 at 08:49
  • 3
    What are you trying to achieve with this normalization? Is that the result you are expecting? Isn't U+00E2 when normalized actually U+0061 U+005E, not just U+0061 (and so on, for the other cases, too)? – Jere Käpyaho Jul 03 '14 at 09:23
  • Yes it's the desired result I'd like to get, and don't know how. – membersound Jul 03 '14 at 12:37
  • In your example, you've also changed æ to a, where usually it would be left alone or changed to ae. Would you do something similar with œ? And what about the letter þ (thorn)? Dealing with non-English characters can get tricky... – DPenner1 Jul 03 '14 at 15:38
  • I would be also fine with `ae`. I'd like to have a result as closed as it can be, but I know there will always be some indifference. That's ok. – membersound Jul 03 '14 at 19:01
  • I'm curious about your goal. What kind of texts would you be transforming? Wouldn't it make foreign words look like German words and cause confusion? – Tom Blodget Jul 04 '14 at 16:15
  • It's an adapter for a legacy application that is only used by german customers, and can therefore only display the "german" alphabet. Any other asci letters like french/turkish/scandinavic specific will cause errors. – membersound Jul 05 '14 at 08:16

2 Answers2

1

From here I see 2 solutions, the first one is quite dirty the second is quite boring to implement I guess.

Community
  • 1
  • 1
alain.janinm
  • 19,951
  • 10
  • 65
  • 112
  • 2
    A third alternative is to split the input string on umlauted characters, normalize the parts in between, and join it back. – Jongware Jul 03 '14 at 19:19
1
// Latin to ASCII - mostly
private static final String TAB_00C0 = "" +
        "AAAAÄAACEEEEIIII" +
        "DNOOOOÖ×OUUUÜYTß" +
        "aaaaäaaceeeeiiii" +
        "dnooooö÷ouuuüyty" +
        "AaAaAaCcCcCcCcDd" +
        "DdEeEeEeEeEeGgGg" +
        "GgGgHhHhIiIiIiIi" +
        "IiJjJjKkkLlLlLlL" +
        "lLlNnNnNnnNnOoOo" +
        "OoOoRrRrRrSsSsSs" +
        "SsTtTtTtUuUuUuUu" +
        "UuUuWwYyYZzZzZzs";

private static HashMap<Character, String> LIGATURES = new HashMap<>(){{
    put('æ', "ae"); 
    put('œ', "oe");
    put('þ', "th");
    put("ij", "ij");
    put('ð', "dh");
    put("Æ", "AE");
    put("Œ", "OE");
    put("Þ", "TH");
    put("Ð", "DH");
    put("IJ", "IJ");
    //TODO
}};

public static String removeAllButUmlauts(String value) {
    value = Normalizer.normalize(value, Normalizer.Form.NFC);
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < source.length(); i++) {
        char c = source.charAt(i);
        String  l = LIGATURES.get(c);
        if (l != null){
            sb.append(l);
        } else if (c < 0xc0) {
            sb.append(c); // ASCII and C1 control codes
        } else if (c >= 0xc0 && c <= 0x17f) {
            c = TAB_00C0.charAt(c - 0xc0); // common single latin letters
            sb.append(c);
        } else { 
            // anything else, including Vietnamese and rare diacritics
            l = Normalizer.normalize(Character.toString(c), Normalizer.Form.NFKD)
                    .replaceAll("[\\p{InCombiningDiacriticalMarks}]+", "");
            sb.append(l);
        }

    }
    return sb.toString();
}

and then

String value = "üöäâÇæôøñÁ";
String after = removeAllButUmlauts(value);
System.out.println(after)

gives:

üöäaCaeoonA
Karol S
  • 9,028
  • 2
  • 32
  • 45
  • The function should be named `removeAllButGermanUmlauts`. It's hard to see if that's what it does but that's what the question asks for. German umlauts are "üöä" (upper and lower case). – Tom Blodget Jul 04 '14 at 16:32