2

My app needs to deal with strings that can contain accented characters. I need to be able to group those names into buckets for the different letters of the alphabet.

I had originally thought of using CultureInvariant string comparison in .Net, but there are two problems with this:

  1. It won't actually say that the letter C is the same as C cedilla, but I need that equality.

  2. WinRT's version of .Net doesn't have CultureInvariant as an option anywhere.

Can anyone suggest an algorithm or at least a starting point that I could use to try and group the different letters together?

Thanks.

Philip Colmer
  • 1,426
  • 2
  • 17
  • 30
  • To clarify, do you mean that you would put `é` and `e` in the same bucket? If so, [this post](http://stackoverflow.com/a/249126/187697) might be a starting point. – keyboardP Jun 30 '13 at 17:00
  • Yes, that is what I mean but, as I've pointed out below, WinRT doesn't support Normalize. I think I have found a post on StackOverflow that is the same question, and has an answer, so I'll mark that as the answer to this question. – Philip Colmer Jun 30 '13 at 17:47

2 Answers2

0

There is a code (created by Michael S. Kaplan and referred in quite a few posts) which does the trick for most of the situations:

static string RemoveDiacritics(string stIn)
{
    string stFormD = stIn.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    for (int ich = 0; ich < stFormD.Length; ich++)
    {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if (uc != UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[ich]);
        }
    }

    return (sb.ToString().Normalize(NormalizationForm.FormC));
}

I tested it with Ç/C and with letters with/without accents and works fine (even with apostrophes). In any case, you might have to complement this with a dictionary-based approach or with a set of conditions/switch...case to account for all the possible eventualities. For example:

if (inputString.Contains("ß"))
{
     inputString = inputString.Replace("ß", "ss");
}
Community
  • 1
  • 1
varocarbas
  • 12,354
  • 4
  • 26
  • 37
  • Thanks for this. Unfortunately, WinRT doesn't have Normalize. However, http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net, referenced from http://stackoverflow.com/questions/12334314/is-there-string-normalize-alternative-in-winrt, suggests using the Encoding library which **is** in WinRT and I just need to test that now. – Philip Colmer Jun 30 '13 at 17:46
  • Sorry about that. If you don't mind, I will let this answer anyway just in case someone would find it useful. – varocarbas Jun 30 '13 at 17:51
0

This post, Is there String.Normalize() alternative in WinRT?, has a solution that has been marked as the answer. I haven't tested it but will comment here when I have.

Community
  • 1
  • 1
Philip Colmer
  • 1,426
  • 2
  • 17
  • 30
  • Hmm. Unfortunately, I can't use this solution on Windows Phone because that doesn't support encoding ISO-8859-8. – Philip Colmer Jun 30 '13 at 18:54
  • I've found another StackOverflow question that is specifically about Windows Phone (http://stackoverflow.com/questions/13262845/how-to-remove-accent-from-string-in-wp7) which leads to http://invokeit.wordpress.com/2011/10/06/how-to-remove-diatrics-accent-marks-in-windows-phone-7-x/. Not exactly elegant but it might just do the job. – Philip Colmer Jun 30 '13 at 18:59