3

I want to automatically convert UTF-8 characters like â Ù á Č Ģ to a U a C G so that they would be acceptable in a URL.

So far I have this:

Encoding sourceEncoding = Encoding.GetEncoding(28591); // ISO-8859-1

byte[] asciiBytes = Encoding.Convert(sourceEncoding, Encoding.ASCII, sourceEncoding.GetBytes(<source text>));

String asciiString = Encoding.UTF8.GetString(asciiBytes);

Two problems with this approach:

  1. This works fine for some characters (Č and Ģ), but for others (â, Ù, á) it returns a question mark in place of the character.
  2. The whole site is in UTF-8, not ISO-8859-1, but when I set sourceEncoding to Encoding.UTF8 all of the characters are converted to question marks, so it doesn't work at all.

Got any ideas how I could make this work?

KRTac
  • 2,767
  • 4
  • 22
  • 18
  • Is there a reason you are not simply [url-encoding](http://stackoverflow.com/questions/575440/url-encoding-using-c-sharp) these characters? – Jon Mar 05 '12 at 10:57
  • If you really want to remove diacritics, look at [this answer](http://stackoverflow.com/a/3769995/41071). – svick Mar 05 '12 at 11:21
  • This was answered already [Here](http://stackoverflow.com/questions/497782/how-to-convert-a-string-from-utf8-to-ascii-single-byte-in-c) Make sure that your charcan be displayed at ASCII. Hope it answers your question – David Rasuli Mar 05 '12 at 11:11
  • @Jon: I want to keep URL encoded characters in URLs to a minimum. I want all the characters that have ASCII equivalents to be displayed as their ASCII equivalents (Ģ->G, â->a, ...). – KRTac Mar 05 '12 at 12:32
  • @svick: This method doesn't always give ASCII characters. – KRTac Mar 05 '12 at 12:54
  • @David Rasuli: I would need an answer for ASP.Net 2. – KRTac Mar 05 '12 at 12:55
  • possible duplicate of http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net – Ruben Bartelink May 15 '13 at 08:13

1 Answers1

6

You can remove diacritic marks (often called accent marks, you know: tilde, cédille, umlaut and friends) best using normalization.

The following method should replace 99% of all diacritic marks. The last percent will however still be displayed as ?. If you don't want to see the ? characters, replace them after using this method with an empty string.

public static string RemoveDiacritics(string value)
{
    if (String.IsNullOrEmpty(value))
        return value;

    string normalized = value.Normalize(NormalizationForm.FormD);
    StringBuilder sb = new StringBuilder();

    foreach (char c in normalized)
    {
        if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)
            sb.Append(c);
    }

    Encoding nonunicode = Encoding.GetEncoding(850);
    Encoding unicode = Encoding.Unicode;

    byte[] nonunicodeBytes = Encoding.Convert(unicode, nonunicode, unicode.GetBytes(sb.ToString()));
    char[] nonunicodeChars = new char[nonunicode.GetCharCount(nonunicodeBytes, 0, nonunicodeBytes.Length)];
    nonunicode.GetChars(nonunicodeBytes, 0, nonunicodeBytes.Length, nonunicodeChars, 0);

    return new string(nonunicodeChars);
}

Hope that helps!

Martin Buberl
  • 45,844
  • 25
  • 100
  • 144