18

Some of our users use e-mail clients that can't cope with Unicode, even when the encoding, etc. are properly set in the mail headers.

I'd like to 'normalise' the content they're receiving. The biggest problem we have is users copy'n'pasting content from Microsoft Word into our web application, which then forwards that content by e-mail - including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you.

I'm guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that'll get me started?

There's basically three phases involved.

First, stripping accents from otherwise-normal letters - solution to this is here

This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions

goes to

This paragraph contains “smart quotes” and accents and ½ of the problem is fractions

Second, replacing single Unicode characters with their ASCII equivalent, to give:

This paragraph contains "smart quotes" and accents and ½ of the problem is fractions

This is the part where I'm hoping there's a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence - ½ to 1/2, and so on - which I'm pretty sure isn't natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use.

Any ideas?

Community
  • 1
  • 1
Dylan Beattie
  • 53,688
  • 35
  • 128
  • 197
  • Wouldn't replacing `Ä` by `Ae` make more sense than replacing it with `A`? – CodesInChaos May 28 '11 at 18:47
  • And how do you detect that the target mail client will not understand utf-8? – CodesInChaos May 28 '11 at 18:50
  • @CodeInChaos - we detect non-Unicode-compatible mail clients by waiting for the users to telephone us and complain that our software is e-mailing them garbage. This happens quite a lot. We cannot reproduce the fault; it seems to affect Outlook 2003 but our test VM with Outlook 2003 displays unicode just fine. – Dylan Beattie May 28 '11 at 19:14
  • @CodeInChaos - re: Ä --> Ae - exactly. Also, replacing the Unicode ellipsis with "...", ° with " degrees", © with (C) and many other replacements that preserve the meaning of the original Unicode text as best they can. – Dylan Beattie May 28 '11 at 19:16
  • If you're replacing Unicode quotation marks with `"`, be careful about delimiter collision attacks. – dan04 May 29 '11 at 06:25
  • @CodeInChaos - Replacing `Ä` by `Ae` makes sense for German, but not necessarily for other languages using the `Ä` character. – dan04 May 29 '11 at 06:27
  • What I would suggest is to ask the character encoding guru [Michael Kaplan](http://blogs.msdn.com/b/michkap/). IIRC I asked a question some years ago in his blog and got a great answer. Maybe he can help you on this, too. For the accent part, see [this SO posting](http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net). – Uwe Keim May 28 '11 at 18:42

4 Answers4

25

Thank you all for some very useful answers. I realize the actual question isn't "How can I convert ANY Unicode character into its ASCII fallback" - the question is "how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks" ?

In other words - we don't need a general-purpose solution; we need a solution that'll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application. To that end, I analyzed eight years' worth of messages sent through our system looking for characters that aren't representable in ASCII encoding, using this test:

///<summary>Determine whether the supplied character is 
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
    var ascii = new ASCIIEncoding();
    var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
    return(asciiChar == inputChar);
}

I've then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.

public static class StringExtensions {
    private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
    /// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent.  Yes, this is lossy.</summary>
    /// <param name="s">A string.</param>
    /// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
    /// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
    public static string Asciify(this string s) {
        return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
    }

    private static string Asciify(char x) {
        return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
    }

    static StringExtensions() {
        Replacements['’'] = "'"; // 75151 occurrences
        Replacements['–'] = "-"; // 23018 occurrences
        Replacements['‘'] = "'"; // 9783 occurrences
        Replacements['”'] = "\""; // 6938 occurrences
        Replacements['“'] = "\""; // 6165 occurrences
        Replacements['…'] = "..."; // 5547 occurrences
        Replacements['£'] = "GBP"; // 3993 occurrences
        Replacements['•'] = "*"; // 2371 occurrences
        Replacements[' '] = " "; // 1529 occurrences
        Replacements['é'] = "e"; // 878 occurrences
        Replacements['ï'] = "i"; // 328 occurrences
        Replacements['´'] = "'"; // 226 occurrences
        Replacements['—'] = "-"; // 133 occurrences
        Replacements['·'] = "*"; // 132 occurrences
        Replacements['„'] = "\""; // 102 occurrences
        Replacements['€'] = "EUR"; // 95 occurrences
        Replacements['®'] = "(R)"; // 91 occurrences
        Replacements['¹'] = "(1)"; // 80 occurrences
        Replacements['«'] = "\""; // 79 occurrences
        Replacements['è'] = "e"; // 79 occurrences
        Replacements['á'] = "a"; // 55 occurrences
        Replacements['™'] = "TM"; // 54 occurrences
        Replacements['»'] = "\""; // 52 occurrences
        Replacements['ç'] = "c"; // 52 occurrences
        Replacements['½'] = "1/2"; // 48 occurrences
        Replacements['­'] = "-"; // 39 occurrences
        Replacements['°'] = " degrees "; // 33 occurrences
        Replacements['ä'] = "a"; // 33 occurrences
        Replacements['É'] = "E"; // 31 occurrences
        Replacements['‚'] = ","; // 31 occurrences
        Replacements['ü'] = "u"; // 30 occurrences
        Replacements['í'] = "i"; // 28 occurrences
        Replacements['ë'] = "e"; // 26 occurrences
        Replacements['ö'] = "o"; // 19 occurrences
        Replacements['à'] = "a"; // 19 occurrences
        Replacements['¬'] = " "; // 17 occurrences
        Replacements['ó'] = "o"; // 15 occurrences
        Replacements['â'] = "a"; // 13 occurrences
        Replacements['ñ'] = "n"; // 13 occurrences
        Replacements['ô'] = "o"; // 10 occurrences
        Replacements['¨'] = ""; // 10 occurrences
        Replacements['å'] = "a"; // 8 occurrences
        Replacements['ã'] = "a"; // 8 occurrences
        Replacements['ˆ'] = ""; // 8 occurrences
        Replacements['©'] = "(c)"; // 6 occurrences
        Replacements['Ä'] = "A"; // 6 occurrences
        Replacements['Ï'] = "I"; // 5 occurrences
        Replacements['ò'] = "o"; // 5 occurrences
        Replacements['ê'] = "e"; // 5 occurrences
        Replacements['î'] = "i"; // 5 occurrences
        Replacements['Ü'] = "U"; // 5 occurrences
        Replacements['Á'] = "A"; // 5 occurrences
        Replacements['ß'] = "ss"; // 4 occurrences
        Replacements['¾'] = "3/4"; // 4 occurrences
        Replacements['È'] = "E"; // 4 occurrences
        Replacements['¼'] = "1/4"; // 3 occurrences
        Replacements['†'] = "+"; // 3 occurrences
        Replacements['³'] = "'"; // 3 occurrences
        Replacements['²'] = "'"; // 3 occurrences
        Replacements['Ø'] = "O"; // 2 occurrences
        Replacements['¸'] = ","; // 2 occurrences
        Replacements['Ë'] = "E"; // 2 occurrences
        Replacements['ú'] = "u"; // 2 occurrences
        Replacements['Ö'] = "O"; // 2 occurrences
        Replacements['û'] = "u"; // 2 occurrences
        Replacements['Ú'] = "U"; // 2 occurrences
        Replacements['Œ'] = "Oe"; // 2 occurrences
        Replacements['º'] = "?"; // 1 occurrences
        Replacements['‰'] = "0/00"; // 1 occurrences
        Replacements['Å'] = "A"; // 1 occurrences
        Replacements['ø'] = "o"; // 1 occurrences
        Replacements['˜'] = "~"; // 1 occurrences
        Replacements['æ'] = "ae"; // 1 occurrences
        Replacements['ù'] = "u"; // 1 occurrences
        Replacements['‹'] = "<"; // 1 occurrences
        Replacements['±'] = "+/-"; // 1 occurrences
    }
}

Note that there are some rather odd fallbacks in there - like this one:

Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences

That's because one of our users has some program that converts open/close smart-quotes into ² and ³ (like : he said ²hello³) and nobody has ever used them to represent exponentiation, so this will probably work quite nicely for us, but YMMV.

Dylan Beattie
  • 53,688
  • 35
  • 128
  • 197
6

I had some problems with this myself, whilst using a list of strings originally built in Word. I have found that using a simple "String".replace(current char/string, new char/string) command works perfectly. The exact code I used was for smart quotes, or to be exact: left ", right ", left ', and right ' is as follows:

StringName = StringName.Replace(ChrW(8216), "'")     ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'")     ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """")    ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """")    ' Replace any right " with a normal "

I hope this helps anyone out there still having this problem!

Miguel
  • 1,966
  • 2
  • 18
  • 32
Paul
  • 61
  • 1
  • 1
1

is there some built-in method that'll get me started?

The first thing I'd try is to convert the text to NFKD normalization form, with the Normalize on strings method. This suggestion is mentioned in the answer to the question you linked, but I recommend using NFKD instead of NFD because NFKD will remove unwanted typographical distinctions (e.g., NBSP → space, or ℂ → C).

You might also be able to make generic replacements by Unicode category. For example, Pd's can be replaced by -, Nd's can be replaced by the corresponding 0-9 digit, and Mn's can be replaced with the empty string (to remove accents).

but somebody might have written a suitable lookup table I can re-use.

You could try using the data from the Unidecode program, or CLDR.

Edit: There's a huge substitution chart here.

dan04
  • 87,747
  • 23
  • 163
  • 198
  • The linked substitution chart moved: http://unicode.org/cldr/charts/32/supplemental/character_fallback_substitutions.html – Jeremy Murray Oct 10 '17 at 16:10
-1

You should never try to convert Unicode to ASCII because you will end-up having more problems than solving.

It's like trying to fit 1,114,112 codepoints (Unicode 6.0) into just 128 characters.

Do you think you will succeed?

BTW, There are lots of quotes in Unicode, not only those mentioned by you and also if you will want to do the conversion anyway remember that the conversions will be dependent on the locale.

Check ICU - that contains the most complete Unicode conversion routines.

sorin
  • 161,544
  • 178
  • 535
  • 806
  • 4
    Our market is quite geographically and culturally specific, and our customers aren't using Unicode on purpose. We get 10-15 calls a week about "gibberish in e-mails", compared to no complaints - ever - about not being able to send e-mails in Arabic or Hebrew. So yes, it's a stupid problem, but it's real :) – Dylan Beattie May 28 '11 at 18:47
  • @Dylan there is no bad question :) BTW, can you specify more about why you cannot use Unicode? I would really like to see why and if there are any workarounds. BTW, I updated my answer to include one place to check. – sorin May 28 '11 at 19:03
  • Isn't ICU for C++ only? (i.e. not for .NET). – Uwe Keim May 28 '11 at 19:08
  • No there is no such thing for .NET as declared http://blogs.msdn.com/b/michkap/archive/2008/12/18/9234330.aspx - but if you write a server application you could add C components and call them. – sorin May 28 '11 at 19:26