Is there a way to dumb down text from Unicode to ASCII?

Question

What I need is something like, for each ASCII character, a list of equivalent Unicode characters.

The problem is that programs like Microsoft Excel and Word insert non-ASCII double-quotes, single-quotes, dashes, etc. when people type into documents. I want to store this text in a database field of type "varchar", which requires single-byte characters.

For the sake of storing ASCII (single-byte) text, some of those Unicode characters could be considered equivalent to or similar enough to a particular ASCII character that replacing the Unicode character with the equivalent ASCII character would be fine.

I would like a simple function like MapToASCII, that would convert Unicode text to an ASCII equivalent, allowing me to specify a replacement character for any Unicode characters that are not similar to any ASCII character.

See also http://stackoverflow.com/questions/138449/how-to-convert-a-unicode-character-to-its-ascii-equivalent — Robert Harvey, Apr 13 '11 at 20:23
That link is irrelevant to my problem, and where did all the comments go with the links I posted? That question looks similar, but it's really asking how to ENCODE a particular code page (hence GetEncoding.GetBytes), not MAP Unicode characters to equivalent ASCII characters, which really has nothing to do with encoding at all. What I'm interested in is something like the WordPress function remove_accents (http://stackoverflow.com/questions/138449/how-to-convert-a-unicode-character-to-its-ascii-equivalent/1748412#1748412) poor guy got down-voted for a good answer IMO, although a bit flawed. — Triynko, Apr 13 '11 at 21:06
Now THIS is highly relevant >> http://stackoverflow.com/questions/4808967/replacing-unicode-punctuation-with-ascii-approximations — Triynko, Apr 14 '11 at 16:48

Mark Wilkins · Answer 1 · 2011-04-13T20:23:11.037

1

The Win32 API WideCharToMultiByte can be used for this conversion (Unicode to ANSI). Use CP_ACP as the first parameter. Something like that would likely be better than trying to build your own mapping function.

Edit At the risk of sounding like I am trying to promote this as a solution against the OP's wishes, it seems that it may be worth pointing out that this API does much (all?) of what is being asking for. The goal is to map (I think) a Unicode string as much as possible to "ANSI" (where ANSI may be something of a moving target in this case). An additional requirement is to be able to specify some alternative character for those that cannot be mapped. The following example does this. It "converts" a Unicode string to char and uses an underscore (second to last parameter) for those characters that cannot be converted.

ret = WideCharToMultiByte( CP_ACP, 0, L"abc個חあЖdef", -1, 
                           ac, sizeof( ac ), "_", NULL );
for ( i = 0; i < strlen( ac ); i++ )
  printf( "%c %02x\n", ac[i], ac[i] );

edited Apr 13 '11 at 20:23

answered Apr 13 '11 at 14:50

Mark Wilkins

40,729
5
57
110

1

Looks like a fun way to shoot yourself in the foot: "`CP_ACP`: The system default Windows ANSI code page. Note: This value can be different on different computers, even on the same network. It can be changed on the same computer, leading to stored data becoming irrecoverably corrupted. This value is only intended for temporary use and permanent storage should use UTF-16 or UTF-8 if possible." But well, looks like the OP likes the pain of mutually incompatible charsets, and your answer correctly lets him do that, so +1 from me. – Piskvor left the building Apr 13 '11 at 15:09
@Piskvor: Indeed! The results of using this function will only be satisfactory about 5% of the time (and satisfactory is a relative term ;) But without knowing more about the OP's requirements, it's hard to say; It may actually work okay. – Mark Wilkins Apr 13 '11 at 15:25
This is not a solution at all. I'm trying to map Unicode characters to their ASCII equivalent, IF ONE EXISTS. Any Unicode characters that are unlike any ASCII character will be discarded and replaced by a specified dummy character. Lest anyone jump on my case, when I say "equivalent" and "unlike", I realize this is a judgement call, but I provided a link to a table as a reference for what I'm looking for: http://www.unicode.org/charts/normalization/chart_OtherPunctuation.html – Triynko Apr 13 '11 at 18:51
Just throw away any character that you don't like. That is as sensible as what you are trying. Read: NOT! And yes, I actually do know the way to do this, but I am not about to tell somebody how to blow up their data. – tchrist Apr 13 '11 at 19:12
@Piskvor: "looks like the OP likes the pain of mutually incompatible charsets". Your unwarranted pretension and upvote of a question you clearly despise demonstrates that you're not taking this seriously. FYI, I have custom, reflection-guided, Regex-constrained string data classes, with auto-generated/minimized/optimized MSIL functions deployed as CLR Assemblies to MSSQL, auto-wrapped in check constraints, and applied to the fields originally specified in the reflected C# attribute that caused their creation, ensuring UNIFIED/IDENTICAL constraints across my application code AND database. – Triynko Apr 13 '11 at 19:12
@tchrist: huh? Right, throwing out chars I don't like is NOT as sensible as replacing curly quotes with an ASCII quote character, for example, since they're basically the same thing. That's all I'm trying to do. – Triynko Apr 13 '11 at 19:18
1

@Triynko: I don't have anything personal against you or the question (and I'm completely serious), I've just seen enough apps coded by people thinking "unicode schmunicode, ASCII should be enough for everybody" to recognize the most common byproduct thereof (caused by the dev beating hir head against the desk). Your question (without your later comments) originally seemed to be of this variety. I apologize for having underestimated your intelligence and/or experience. That said, could you just make a map of all "interesting" Unicode characters to their ASCII near-equivalents? – Piskvor left the building Apr 13 '11 at 19:47
No worries. I completely understand, because I'd like to go back and smack myself for implementing "varchar" instead of "nvarchar", although my DB is smaller and faster as a result, in terms of page caching and storing/transfering backups. Anyway, yeah, that's basically what I want to do. I just threw the question out here to avoid reinventing the wheel, hoping someone could post some working code or a link to such "interesting" characters. Looks like the comments were deleted to my question by a high-ranking member of stack-overflow, with a link to an irrelevant question :( – Triynko Apr 13 '11 at 21:01

score 0 · Answer 2 · edited May 23 '17 at 12:03

A highly relevant question is here: Replacing unicode punctuation with ASCII approximations

Although the answer there is insufficient, it gave me an idea. I could map each of the Unicode code points in the Basic Multilingual Plane (0) to an equivalent ASCII character, if one exists. The following C# code will help by creating an HTML form in which you can type a replacement character for each value.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.IO;

namespace UnicodeCharacterCategorizer
{
    class Program
    {
        static void Main(string[] args)
        {
            string output_filename = "output.htm"; //set a filename if not specifying one through the command line
            Dictionary<UnicodeCategory,List<char>> category_character_sets = new Dictionary<UnicodeCategory,List<char>>();
            foreach (UnicodeCategory c in Enum.GetValues(typeof(UnicodeCategory)))
                category_character_sets.Add( c, new List<char>() );
            for (int i = 0; i <= 0xFFFF; i++)
            {
                if (i >= 0xD800 && i <= 0xDFFF) continue; //Skip ranges reserved for high/low surrogate pairs.
                char c = (char)i;
                UnicodeCategory category = char.GetUnicodeCategory( c );
                category_character_sets[category].Add( c );
            }
            StringBuilder file_data = new StringBuilder( @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html xmlns=""http://www.w3.org/1999/xhtml""><head><title>Unicode Category Character Sets</title><style>.categoryblock{border:3px solid black;margin-bottom:10px;padding:5px;} .characterblock{display:inline-block;border:1px solid grey;padding:5px;margin-right:5px;} .character{display:inline-block;font-weight:bold;background-color:#ffeeee} .numericvalue{color:blue;}</style></head><body><form id=""charactermap"">" );
            foreach (KeyValuePair<UnicodeCategory,List<char>> entry in category_character_sets)
            {
                file_data.Append( @"<div class=""categoryblock""><h1>" + entry.Key.ToString() + ":</h1><br />" );
                foreach (char c in entry.Value)
                {
                    string hex_value = ((int)c).ToString( "x" );
                    file_data.Append( @"<div class=""characterblock""><span class=""character"">&#x" + hex_value + @";<br /><span class=""numericvalue"">" + hex_value + @"</span><br /><input type=""text"" name=""r_" + hex_value + @""" /></div>" );
                }
                file_data.Append( "</div>" );
            }
            file_data.Append("</form></body></html>" );
            File.WriteAllText( output_filename, file_data.ToString(), Encoding.Unicode );
        }
    }
}

Specifically, that code will generate an HTML form containing all characters in the BMP, along with input text boxes named after the hex values prefixed with "r_" (r is for "replacement value"). If this ported over to an ASP.NET page, additional code could be written to pre-populate replacement values as much as possible:

with their own value if already ASCII, or
with Unicode normalized FormD or FormKD decomposed equivalents, or
a single ASCII value for an entire category (i.e. all "punctuation initial" characters with a ASCII double quote)

You could then go through manually and make adjustments, and it probably wouldn't take as long as you'd think. There are only 64512 code points, and large chunks of entire categories can probably be dismissed as "no even close to anything ASCII". So, I'm going to build this map and function.

I just discovered the DecoderFallback property of the Encoding class: see http://msdn.microsoft.com/en-us/library/system.text.decoderfallback.aspx "Best-fit fallback, which maps valid Unicode characters that cannot be decoded to an approximate equivalent. For example, a best-fit fallback handler for the ASCIIEncoding class might map Æ (U+00C6) to AE (U+0041 + U+0045). A best-fit fallback handler might also be implemented to transliterate one alphabet (such as Cyrillic) to another (such as Latin or Roman). **The .NET Framework does not provide any public best-fit fallback implementations.**" — Triynko, Apr 14 '11 at 20:16
And this is a NICE description of what I'm trying to do. Yeah, it says it's a bad idea to try to map Unicode into ASCII, but if you're going to do it anyway, this article helps you think about what you're doing - http://blogs.msdn.com/b/shawnste/archive/2006/01/19/515047.aspx — Triynko, Apr 14 '11 at 20:19
Information on implementing an Encoding Fallback Strategy can be found here - http://msdn.microsoft.com/en-us/library/ms404377(v=VS.100).aspx You just have to implement your own versions of DecoderFallback and DecoderFallbackBuffer. — Triynko, Apr 14 '11 at 21:06

Is there a way to dumb down text from Unicode to ASCII?

2 Answers2