Why don't some diacritics get stripped?

Question

I am using the method from this answer to remove special characters from words and change them to a simple form. This works pretty nicely for many basic accents, e.g.

Malmö becomes "Malmo"
München becomes "Munchen"
Åge becomes "Age"

However this doesn't work on some other characters, for example:

Strømsgodset remains "Strømsgodset"
Kulħadd remains "Kulħadd"

Is there any reason why these characters are not converted like the others?

Also is there any way to similarly convert 'combined' characters such as:

æ -> ae
ẞ -> ss

Because linguists and bureaucrats at the Unicode Consortium decided so. — xanatos, May 08 '15 at 13:30
According to your second question how to map them to a pair of other characters, use a `Dictionary`. Then it's easy: `foreach(var kv in dict) text=text.Replace(kv.Key.ToString(),kv.Value)` — Tim Schmelter, May 08 '15 at 13:36
That will work if you know all the special characters in every language in the world. — Gigi, May 08 '15 at 13:44
@Gigi There are no special characters. What you said is like saying that sushi is special food. — xanatos, May 08 '15 at 13:56
What is your final goal? Why do you need to convert letters to the form you call simple? BTW, Kullħadd has one more `L`. — Dialecticus, May 08 '15 at 14:31
To match search results regardless of diacritics. Kullħadd is the newspaper, Kulħadd means "everyone" (I'm Maltese :)). — Gigi, May 08 '15 at 14:52

xanatos · Accepted Answer · 2015-05-08T15:09:17.703

3

Because the Normalization chart written by the Unicode Consortium doesn't have the decompositions you want, and Microsoft used that chart (or more probably a text version of that chart, or perhaps an older version of that chart, but these are details).

I don't know the reason, because I'm not a linguist, but I do hope that there are enough good linguists in the Unicode Consortium to do the correct choices.

Note that collation tables are separate from normalization tables, so you can have that:

int res = string.Compare("æ", "ae", CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace);

is 0... so æ == ae, and ħ == h

You can even IndexOf, using the collation:

int ix = CultureInfo.CurrentCulture.CompareInfo.IndexOf(
    "Ad aeternitatem", 
    "æter", 
    CompareOptions.IgnoreNonSpace); // 3

and ignoring case:

int ix = CultureInfo.CurrentCulture.CompareInfo.IndexOf(
    "Ad Aeternitatem", 
    "æter", 
    CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase); // 3

edited May 08 '15 at 15:09

answered May 08 '15 at 13:34

xanatos

109,618
12
197
280

For these 'double characters' that makes sense, but does this also hold true for the others (e.g. ħ)? – Gigi May 08 '15 at 13:42
@gigi The fact that it is graphically similar to a `h`, doesn't mean that it is an `h`. Would you like the `$` simbol to be decomposed to a `S` plus a `|` ? :-) – xanatos May 08 '15 at 13:51
Yes? :) Kidding apart, there are practical reasons why it is useful to take advantage of the graphical (but not semantic) similarity, e.g. search. – Gigi May 08 '15 at 14:29
@gigi for search there is the CompareInfo.IndexOf . If you want I can post an example. It uses collation, not normalization – xanatos May 08 '15 at 14:32
By all means, please do :) – Gigi May 08 '15 at 14:53

Why don't some diacritics get stripped?

1 Answers1