5

I need some sort of conversion/mapping that, for example, is done by CLCL clipboard manager.

What it does is like that:

I copy the following Unicode text: ūī
And CLCL converts it to: ui

Is there any technique to do such a conversion? Or maybe there are mapping tables that can be used to convert, let's say, symbol ū is mapped to u.

UPDATE

Thanks to all for help. Here is what I came with (a hybrid of two solutions), one posted by Erik Schierboom and one taken from http://blogs.infosupport.com/normalizing-unicode-strings-in-c/#comment-8984

public static string ConvertUnicodeToAscii(string unicodeStr, bool skipNonConvertibleChars = false)
{
    if (string.IsNullOrWhiteSpace(unicodeStr))
    {
        return unicodeStr;
    }

    var normalizedStr = unicodeStr.Normalize(NormalizationForm.FormD);

    if (skipNonConvertibleChars)
    {
        return new string(normalizedStr.ToCharArray().Where(c => (int) c <= 127).ToArray());
    }

    return new string(
        normalizedStr.Where(
            c =>
                {
                    UnicodeCategory category = CharUnicodeInfo.GetUnicodeCategory(c);
                    return category != UnicodeCategory.NonSpacingMark;
                }).ToArray());
}
net_prog
  • 9,921
  • 16
  • 55
  • 70

2 Answers2

3

I have used the following code for some time:

private static string NormalizeDiacriticalCharacters(string value)
{
    if (value == null)
    {
        throw new ArgumentNullException("value");
    }

    var normalised = value.Normalize(NormalizationForm.FormD).ToCharArray();

    return new string(normalised.Where(c => (int)c <= 127).ToArray());
}
Erik Schierboom
  • 16,301
  • 10
  • 64
  • 81
  • 1
    I dislike the `c <= 127` hack, it’s unnecessary. But yes, that’s the gist of it. – Konrad Rudolph Mar 28 '13 at 13:59
  • Well, otherwise you could have returned a string that contains characters that fall outside the ASCII range, right? – Erik Schierboom Mar 28 '13 at 14:01
  • Look at the question I marked this one as a duplicate of. The “right” way is to look at the Unicode category and only retain non-spacing / non-combining diacritic characters. But to be honest that’s probably way less efficient and in my (admittedly limited) understanding of Unicode, your answer always yields the correct result. – Konrad Rudolph Mar 28 '13 at 14:02
  • Sorry, I missed the duplicate question part. You are right of course. – Erik Schierboom Mar 28 '13 at 14:05
  • It works, but one note, the characters which cannot be mapped, are ignored. For example, "Łukasz" becomes "ukasz". The method used in the "duplicate of" question leaves such characters in output. So, probably, it is a good idea to combine the two methods and put a bool parameter whether to leave or skip. – net_prog Mar 28 '13 at 14:32
  • Not quite what I meant, in the case when I want to leave the characters which cannot be mapped, the other characters, which can be mapped, should still be converted. So if I specify "false" to ignore, the string "Łū" must become "Łu", and if I specify "true", the output string must be "u". – net_prog Mar 28 '13 at 14:44
  • @ErikSchierboom How can I convert ASCII encoded text to utf-8 unicode. How can I map the characters to unicode in c#. Is there any tool available for such mappings. I want to do it after the user enter the content in ascii format. Thank you – Subin Jacob Jul 12 '13 at 06:18
0

In general, it is not possible to convert Unicode to ASCII because ASCII is a subset of Unicode.

That being said, it is possible to convert characters within the ASCII subset of Unicode to Unicode.

In C#, generally there's no need to do the conversion, since all strings are Unicode by default anyway, and all components are Unicode-aware, but if you must do the conversion, use the following:

 string myString = "SomeString";
 byte[] asciiString = System.Text.Encoding.ASCII.GetBytes(myString);
SecurityMatt
  • 6,593
  • 1
  • 22
  • 28
  • This is not what OP meant. – Konrad Rudolph Mar 28 '13 at 13:58
  • @DavinTryon: Can you think of any ASCII characters that aren't contained in, say, UTF-8? I can think of many characters in UTF-8 that aren't in ASCII. For example the character 字 cannot be represented in US-ASCII. – SecurityMatt Mar 29 '13 at 02:08
  • Yes, but saying that it is a subset is not correct. UTF-8 (only one of the unicode formats) was explicitly created to be "backwards compatible" with ASCII. – Davin Tryon Mar 29 '13 at 21:42
  • 2
    @DavinTryon: What definition of subset are you using? Every codepoint in ASCII is contained in Unicode. ASCII is therefore completely contained within Unicode, or in other words, ASCII is a subset of Unicode. That's not to say Unicode predates ASCII, merely that it *contains* every element in ASCII (after all, that's what subset *means*). – SecurityMatt Mar 30 '13 at 13:15