A highly relevant question is here: Replacing unicode punctuation with ASCII approximations
Although the answer there is insufficient, it gave me an idea. I could map each of the Unicode code points in the Basic Multilingual Plane (0) to an equivalent ASCII character, if one exists. The following C# code will help by creating an HTML form in which you can type a replacement character for each value.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Globalization;
using System.IO;
namespace UnicodeCharacterCategorizer
{
class Program
{
static void Main(string[] args)
{
string output_filename = "output.htm"; //set a filename if not specifying one through the command line
Dictionary<UnicodeCategory,List<char>> category_character_sets = new Dictionary<UnicodeCategory,List<char>>();
foreach (UnicodeCategory c in Enum.GetValues(typeof(UnicodeCategory)))
category_character_sets.Add( c, new List<char>() );
for (int i = 0; i <= 0xFFFF; i++)
{
if (i >= 0xD800 && i <= 0xDFFF) continue; //Skip ranges reserved for high/low surrogate pairs.
char c = (char)i;
UnicodeCategory category = char.GetUnicodeCategory( c );
category_character_sets[category].Add( c );
}
StringBuilder file_data = new StringBuilder( @"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN"" ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd""><html xmlns=""http://www.w3.org/1999/xhtml""><head><title>Unicode Category Character Sets</title><style>.categoryblock{border:3px solid black;margin-bottom:10px;padding:5px;} .characterblock{display:inline-block;border:1px solid grey;padding:5px;margin-right:5px;} .character{display:inline-block;font-weight:bold;background-color:#ffeeee} .numericvalue{color:blue;}</style></head><body><form id=""charactermap"">" );
foreach (KeyValuePair<UnicodeCategory,List<char>> entry in category_character_sets)
{
file_data.Append( @"<div class=""categoryblock""><h1>" + entry.Key.ToString() + ":</h1><br />" );
foreach (char c in entry.Value)
{
string hex_value = ((int)c).ToString( "x" );
file_data.Append( @"<div class=""characterblock""><span class=""character"">&#x" + hex_value + @";<br /><span class=""numericvalue"">" + hex_value + @"</span><br /><input type=""text"" name=""r_" + hex_value + @""" /></div>" );
}
file_data.Append( "</div>" );
}
file_data.Append("</form></body></html>" );
File.WriteAllText( output_filename, file_data.ToString(), Encoding.Unicode );
}
}
}
Specifically, that code will generate an HTML form containing all characters in the BMP, along with input text boxes named after the hex values prefixed with "r_" (r is for "replacement value"). If this ported over to an ASP.NET page, additional code could be written to pre-populate replacement values as much as possible:
- with their own value if already ASCII, or
- with Unicode normalized FormD or FormKD decomposed equivalents, or
- a single ASCII value for an entire category (i.e. all "punctuation initial" characters with a ASCII double quote)
You could then go through manually and make adjustments, and it probably wouldn't take as long as you'd think. There are only 64512 code points, and large chunks of entire categories can probably be dismissed as "no even close to anything ASCII". So, I'm going to build this map and function.