How can I remove accents on a string?

Question

Possible Duplicate:
How do I remove diacritics (accents) from a string in .NET?

I have the following string

áéíóú

which I need to convert it to

aeiou

How can I achieve it? (I don't need to compare, I need the new string to save)

Not a duplicate of How do I remove diacritics (accents) from a string in .NET?. The accepted answer there doesn't explain anything and that's why I've "reopened" it.

Damn, want to rescind my reopen - it's definitely a duplicate. @BrunoLM if you dont like the answer it's better to put a bounty on it that ask a dup — Ruben Bartelink, May 15 '13 at 08:15

Jon Hanna · Accepted Answer · 2010-09-22T15:34:13.573

It depends on requirements. For most uses, then normalising to NFD and then filtering out all combining chars will do. For some cases, normalising to NFKD is more appropriate (if you also want to removed some further distinctions between characters).

Some other distinctions will not be caught by this, notably stroked Latin characters. There's also no clear non-locale-specific way for some (should ł be considered equivalent to l or w?) so you may need to customise beyond this.

There are also some cases where NFD and NFKD don't work quite as expected, to allow for consistency between Unicode versions.

Hence:

public static IEnumerable<char> RemoveDiacriticsEnum(string src, bool compatNorm, Func<char, char> customFolding)
{
    foreach(char c in src.Normalize(compatNorm ? NormalizationForm.FormKD : NormalizationForm.FormD))
    switch(CharUnicodeInfo.GetUnicodeCategory(c))
    {
      case UnicodeCategory.NonSpacingMark:
      case UnicodeCategory.SpacingCombiningMark:
      case UnicodeCategory.EnclosingMark:
        //do nothing
        break;
      default:
        yield return customFolding(c);
        break;
    }
}
public static IEnumerable<char> RemoveDiacriticsEnum(string src, bool compatNorm)
{
  return RemoveDiacritics(src, compatNorm, c => c);
}
public static string RemoveDiacritics(string src, bool compatNorm, Func<char, char> customFolding)
{
  StringBuilder sb = new StringBuilder();
  foreach(char c in RemoveDiacriticsEnum(src, compatNorm, customFolding))
    sb.Append(c);
  return sb.ToString();
}
public static string RemoveDiacritics(string src, bool compatNorm)
{
  return RemoveDiacritics(src, compatNorm, c => c);
}

Here we've a default for the problem cases mentioned above, which just ignores them. We've also split building a string from generating the enumeration of characters so we need not be wasteful in cases where there's no need for string manipulation on the result (say we were going to write the chars to output next, or do some further char-by-char manipulation).

An example case for something where we wanted to also convert ł and Ł to l and L, but had no other specialised concerns could use:

private static char NormaliseLWithStroke(char c)
{
  switch(c)
  {
     case 'ł':
       return 'l';
     case 'Ł':
       return 'L';
     default:
       return c;
  }
}

Using this with the above methods will combine to remove the stroke in this case, along with the decomposable diacritics.

There are some syntax problems, could you fix them? Your answer works and is very enlightening. Thank you. — BrunoLM, Sep 22 '10 at 15:24
Right you are Bruno, a few bits wrong due to writing straight as a reply rather than copying from a code editor. Should be correct now. — Jon Hanna, Sep 22 '10 at 15:35
+1 It seems to work but I don't follow. Would you explain customFolding? — paparazzo, Sep 03 '12 at 18:54
@Blam It's to catch cases that the basic approach doesn't cater for. of the examples given, the `c => c` lambda just ignores the issue, while `NormaliseLWithStroke` removes the stroke from stroked L without dealing with any other cases. If you'll never use it, you could replace `yield return customFolding(c);` with just `yield return c;` and gain a performance boost. On the other hand, normalising back to NFC is probably a good idea in terms of how it'll deal with Korean Hangul. — Jon Hanna, Sep 03 '12 at 19:27
This does more than just remove diacritics. Foreach expands ¼ to 1/4. It expands 556 different characters when I tested with FormKD. — paparazzo, Apr 16 '13 at 22:31
Any chance of you moving this answer to the cited dup of the question? — Ruben Bartelink, May 15 '13 at 08:16

score 18 · Answer 2 · answered Sep 22 '10 at 13:04

18

public string RemoveDiacritics(string input)
{
    string stFormD = input.Normalize(NormalizationForm.FormD);
    int len = stFormD.Length;
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < len; i++)
    {
        System.Globalization.UnicodeCategory uc = System.Globalization.CharUnicodeInfo.GetUnicodeCategory(stFormD[i]);
        if (uc != System.Globalization.UnicodeCategory.NonSpacingMark)
        {
            sb.Append(stFormD[i]);
        }
    }
    return (sb.ToString().Normalize(NormalizationForm.FormC));
}

answered Sep 22 '10 at 13:04

cichy

10,464
4
26
36

1

Why allow SpacingCombiningMark and EnclosingMark? – Jon Hanna Sep 22 '10 at 14:01
As anwered above by Karaszi, its only example of how it can be done. Bruno didnt specified exact requirements. – cichy Sep 22 '10 at 15:25
@cichy string has no Normalize method !? – onmyway133 Nov 07 '12 at 02:48
@entropy Yes it does. :| Link to the Mono docs on it. http://goo.gl/HnBcJK – Dan Atkinson Aug 08 '13 at 18:20
Good solution, also works great with Arabic diacritics and other special characters. – Tamim Al Manaseer Jun 24 '14 at 14:46

How can I remove accents on a string?

2 Answers2

Linked