15

How do I to convert different Unicode characters to their closest ASCII equivalents? Like Ä -> A. I googled but didn't find any suitable solution. The trick Encoding.ASCII.GetBytes("Ä")[0] didn't work. (Result was ?).

I found that there is a class Encoder that has a Fallback property that is exactly for cases when char can't be converted, but implementations (EncoderReplacementFallback) are stupid and convert to ?.

Any ideas?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andrey
  • 59,039
  • 12
  • 119
  • 163
  • 4
    How are you defining 'closest ascii equivalents'? Just removing accent marks? – mmr Apr 12 '10 at 19:12
  • well... if there is ANY definition except for converting to ? i would like it :) i think in cases of accents it is just removing. – Andrey Apr 12 '10 at 19:15
  • @mmr: I don't know about Andrey's intended usage, but I have a program which is supposed to accept text and give it to an attached device which will later display it. Many kinds of attached devices can't afford the RAM required to store multi-byte characters, and couldn't afford the ROM needed to store more than 256 character glyphs anyway [the majority of character-matrix LCD modules have a fixed 160-character set plus the ability to show eight simultaneous custom 5x7 or 5x8 glyphs]. Rendering all non-ASCII text as "?" would seem needlessly ugly. – supercat Jan 15 '14 at 19:24
  • @supercat my reason is that we were sending international sms and we couldn't be sure that symbols present in the SMS encoding (I don't remember details), so we agreed is that it is better to strip out diacritics completely than to show question marks. But again, I don't remember details. But essentially it is very similar to your case. – Andrey Jan 15 '14 at 23:44

2 Answers2

9

If it is just removing of the diacritical marks, then head to this answer:

static string RemoveDiacritics(string stIn) {
  string stFormD = stIn.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  for(int ich = 0; ich < stFormD.Length; ich++) {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
    if(uc != UnicodeCategory.NonSpacingMark) {
      sb.Append(stFormD[ich]);
    }
  }

  return(sb.ToString().Normalize(NormalizationForm.FormC));
}
Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
2

MS Dynamics has a problem where it won't allow for any character outside of x20 to x7f and some characters within that range are also invalid. My answer was to create an array keyed to the invalid characters returning the best guess of the valid characters.
It ain't pretty, but it works.

Function PlainAscii(InText)
Dim i, c, a
Const cUTF7 = "^[\x20-\x7e]+$"
Const IgnoreCase = False
    PlainAscii = ""
    If InText = "" Then Exit Function
    If RegExTest(InText, cUTF7, IgnoreCase) Then
        PlainAscii = InText
    Else
        For i = 1 To Len(InText)
            c = Mid(InText, i, 1)
            a = Asc(c)
            If a = 10 Or a = 13 Or a = 9 Then
                ' Do Nothing - Allow LF, CR & TAB
            ElseIf a < 32 Then
                c = " "
            ElseIf a > 126 Then
                c = CvtToAscii(a)
            End If
            PlainAscii = PlainAscii & c
        Next
    End If
End Function

Function CvtToAscii(inChar)
' Maps The Characters With The 8th Bit Set To 7 Bit Characters
Dim arrChars
    arrChars = Array(" ", " ", "$", " ", ",", "f", """", " ", "t", "t", "^", "%", "S", "<", "O", " ", "Z", " ", " ", "'", "'", """", """", ".", "-", "-", "~", "T", "S", ">", "o", " ", "Z", "Y", " ", "!", "$", "$", "o", "$", "|", "S", " ", "c", " ", " ", " ", "_", "R", "_", ".", " ", " ", " ", " ", "u", "P", ".", ",", "i", " ", " ", " ", " ", " ", " ", "A", "A", "A", "A", "A", "A", "A", "C", "E", "E", "E", "E", "I", "I", "I", "I", "D", "N", "O", "O", "O", "O", "O", "X", "O", "U", "U", "U", "U", "Y", "b", "B", "a", "a", "a", "a", "a", "a", "a", "c", "e", "e", "e", "e", "i", "i", "i", "i", "o", "n", "o", "o", "o", "o", "o", "/", "O", "u", "u", "u", "u", "y", "p", "y")
    CvtToAscii = arrChars(inChar - 127)
End Function

Function RegExTest(ByVal strStringToSearch, strExpression, IgnoreCase)
Dim objRegEx
    On Error Resume Next
    Err.Clear
    strStringToSearch = Replace(Replace(strStringToSearch, vbCr, ""), vbLf, "")
    RegExTest = False
    Set objRegEx = New RegExp
    With objRegEx
        .Pattern = strExpression    '//the reg expression that should be searched for
        If Err.Number = 0 Then
            .IgnoreCase = CBool(IgnoreCase)    '//not case sensitive
            .Global = True              '//match all instances of pattern
            RegExTest = .Test(strStringToSearch)
        End If
    End With
    Set objRegEx = Nothing
    On Error Goto 0
End Function

Your answer is necessarily going to be different.

Dave
  • 1,234
  • 13
  • 24