1

What I am working with:
Within my Asp.net Webforms application, I am getting form data from the user and then inserting that data into a SQL Server database. Each key is the identifier for the field from within the form, and the value is the data received by the user.

My Issue:
My issue is that users are copying and pasting UTF-8 data from emails, etc into the "notes" field. The SQL Server database does not recognize UTF-8 as valid character data. Instead, it utilizes both the the UCS-2 & ISO-8859-1 character sets. Thus, these character sets are being inserted into the database as question marks (?). So, I would like to properly convert any UTF-8 characters to UCS-2 or ISO-8859-1.

Questions:

  1. Should I convert the UTF-8 characters to UCS-2 or to ISO-8859-1?
  2. Within the ASP.NET web form, what is the best means of determining the character sets used within the value for the "notes" key of my hashtable?
  3. What is the best possible means for converting the characters that are UTF-8 into the acceptable character set?
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Spiffai
  • 11
  • 3
  • Are you looking for a way to convert that to ascii so that you can save it to a text field, or are you looking to save the data "as pasted"? – bri Jan 26 '16 at 23:12

1 Answers1

0

Option 1: use nvarchar

You could just change your field from varchar to nvarchar so that your unicode characters are stored correctly. That's the point of that nvarchar data type. It's cool. Use it.

Option 2: Convert Intelligently.

If you have a legacy db where nvarchar simply wont work, then you can just create a string extension that lets you store the ascii version of your values from users. Below is one such extension (note that we are doing some initial replacements for "smart" quotes/etc before ditching all characters that aren't ascii).

if you're supporting international (accents, etc), then this is a little culturally insensitive ("bah - away with your crazy accent marks and strange non-english looking letters").

public static class StringExt {
    static public string TryGetAsciiString(this string original) {

        //Replace those msword "smart" characters with ascii (dumb) characters.
        string escaped = System.Convert.ToString(p_String.Replace('\u2013', '-').Replace('\u2014', '-').Replace('\u2015', '-').Replace('\u2017', '_').Replace('\u2018', '\'').Replace('\u2019', '\'').Replace('\u201a', ',').Replace('\u201b', '\'').Replace('\u201c', '\"').Replace('\u201d', '\"').Replace('\u201e', '\"').Replace("\u2026", "...").Replace('\u2032', '\'').Replace('\u2033', '\"'));
        //regex out all those other non-ascii characters.
        escaped = Regex.Replace(p_sVal, "[^A-Za-z 0-9 \\.,\\?\'\"!@#\\$%\\^&\\*\\(\\)-_=\\+;:<>\\/\\\\\\|\\}\\{\\[\\]`~\\n\\r]*", "");
        //All set..
        return escaped;
    }
}

Option ... err... 2A? : Ditch the first 30 ascii codes (give or take)

I've noticed that, when users copy/paste from MAC word (and a few other programs), that pasted data contains characters in the first 30 ascii characters. Aside from 9, 10 and 13 ... you can probably ditch those (they're just NUL's ACK's DC's and some other garbage no user would actually type).

Community
  • 1
  • 1
bri
  • 2,932
  • 16
  • 17