How to put strings into culture-invariant buckets?

Question

My app needs to deal with strings that can contain accented characters. I need to be able to group those names into buckets for the different letters of the alphabet.

I had originally thought of using CultureInvariant string comparison in .Net, but there are two problems with this:

It won't actually say that the letter C is the same as C cedilla, but I need that equality.
WinRT's version of .Net doesn't have CultureInvariant as an option anywhere.

Can anyone suggest an algorithm or at least a starting point that I could use to try and group the different letters together?

Thanks.

To clarify, do you mean that you would put `é` and `e` in the same bucket? If so, [this post](http://stackoverflow.com/a/249126/187697) might be a starting point. — keyboardP, Jun 30 '13 at 17:00
Yes, that is what I mean but, as I've pointed out below, WinRT doesn't support Normalize. I think I have found a post on StackOverflow that is the same question, and has an answer, so I'll mark that as the answer to this question. — Philip Colmer, Jun 30 '13 at 17:47

score 0 · Answer 1 · edited Oct 08 '14 at 17:19

There is a code (created by Michael S. Kaplan and referred in quite a few posts) which does the trick for most of the situations:

static string RemoveDiacritics(string stIn)
{
    string stFormD = stIn.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    for (int ich = 0; ich < stFormD.Length; ich++)
    {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if (uc != UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[ich]);
        }
    }

    return (sb.ToString().Normalize(NormalizationForm.FormC));
}

I tested it with Ç/C and with letters with/without accents and works fine (even with apostrophes). In any case, you might have to complement this with a dictionary-based approach or with a set of conditions/switch...case to account for all the possible eventualities. For example:

if (inputString.Contains("ß"))
{
     inputString = inputString.Replace("ß", "ss");
}

Thanks for this. Unfortunately, WinRT doesn't have Normalize. However, http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net, referenced from http://stackoverflow.com/questions/12334314/is-there-string-normalize-alternative-in-winrt, suggests using the Encoding library which **is** in WinRT and I just need to test that now. — Philip Colmer, Jun 30 '13 at 17:46
Sorry about that. If you don't mind, I will let this answer anyway just in case someone would find it useful. — varocarbas, Jun 30 '13 at 17:51

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

This post, Is there String.Normalize() alternative in WinRT?, has a solution that has been marked as the answer. I haven't tested it but will comment here when I have.

edited May 23 '17 at 12:05

Community

1
1

answered Jun 30 '13 at 17:48

Philip Colmer

1,426
2
17
30

Hmm. Unfortunately, I can't use this solution on Windows Phone because that doesn't support encoding ISO-8859-8. – Philip Colmer Jun 30 '13 at 18:54
I've found another StackOverflow question that is specifically about Windows Phone (http://stackoverflow.com/questions/13262845/how-to-remove-accent-from-string-in-wp7) which leads to http://invokeit.wordpress.com/2011/10/06/how-to-remove-diatrics-accent-marks-in-windows-phone-7-x/. Not exactly elegant but it might just do the job. – Philip Colmer Jun 30 '13 at 18:59

How to put strings into culture-invariant buckets?

2 Answers2