1

I have an ASP Classic app that allows people to copy and paste Word documents into a regular form field. I then post that document via jQuery Ajax to SQL Server, where the information is saved.

My problem is that the curly quotes and other word characters turn into strange characters when they come back out.

I'm trying to filter them on my save routines (classic asp stored procedure), but I still can't quite eliminate the problems.

The ASP pages have this header with the ISO-8859-1 charset. Characters look fine when pasted into the text input fields.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
          "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" lang="en">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

My jQuery code builds the following JSON in the ASP Page:

var jsonToSend = { serial: serial, critiqueText: escape(critiqueText) };

The database collation is set to SQL_Latin1_General_CP1_CI_AS

I use TEXT and VARCHAR fields to hold the text (yes, I know the Text field type is not preferred, but it's what I have right now).

What must I do at each point to ensure that (1) the Word characters are stripped out, and (2) the encoding is consistent from front to back so I don't get any odd characters displaying?

Oh- ASP Classic 3 running in 32-bit mode on Windows Server 2003 against SQL Server 2005.

Oleg
  • 220,925
  • 34
  • 403
  • 798
Caveatrob
  • 12,667
  • 32
  • 107
  • 187
  • 1
    Have you tried saving this as NTEXT or NVarchar instead? You might not have to do any of this at all. – RBarryYoung Feb 06 '13 at 17:51
  • If I just change the data type on a field from VarChar to nVarChar, will that work straightaway even on existing data? – Caveatrob Feb 06 '13 at 17:52

3 Answers3

0

Quick and dirty solution would be using nvarchar and ntext in your backend database. Strange chars you mention is problem of encoding. For example see below example.

  • İiıIÜĞ in turkish language win-1254
  • İiıIÃœÄ in normal ANSI
  • C4B069C4B149C39CC49E both of them have same hex value.

You use ISO-8859-1 encoding in web page. This means that you are only able to save only ASCII characters that is only first 256 bit of full unicode. See this answer. You use Latin1 in database. Approximately this three characters sets are equal. Latin1-General = Win 1252 = IEC_8859-1.

  ISO/IEC_8859-1 is the basis for most popular 8-bit character sets, including Windows-1252 and the first block of characters in Unicode.

  SQL_Latin1_General_CP1_CI_AS:- Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, 
  width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data

This means that whatever character you entered to database first 256 bits values are safe. If you know your client's default encodings. I suggest to try this default encoding to see if you can recover some information. I gave example in Turkey, I know that most client's use Win1254 therefore I will try to change values to that encoding and see I can recover anything.

Second part of your answer is that you can safely change from varchar to nvarchar without loss of information. Here this without loss of information would be first part hex value (first 256 value). Your strange chars would remain but other characters stays.

This answer and linked article gives more information.

Community
  • 1
  • 1
Atilla Ozgur
  • 14,339
  • 3
  • 49
  • 69
  • What are the implications of just changing the field types and copying the existing char and varchar data? – Caveatrob Feb 02 '13 at 04:13
  • 1
    Browsers treat `"ISO-8859-1"` as a synonym for Windows-1252, and this is even in the specification draft for HTML5. So does SQL Server, apparently. It is only the *real* ISO-8859-1 that uses first 256 characters of unicode and therefore cannot encode smart quotes, euro symbol and so on. – Esailija Feb 03 '13 at 10:41
0

You should not use the javascript function escape, it uses non-standard encoding that is a mix of standard URL encoding using ISO-8859-1 and a weird %uxxxx scheme for anything not in ISO-8859-1. Additionally, you should not manually escape anything at all, since jQuery will use proper escaping on your jsonToSend-object anyway.

So when you do this:

var jsonToSend= { serial: serial, critiqueText: escape(critiqueText) } ;
$.post( "example.asp", jsonToSend );

And critiqueText is, say, “hello world”. First the escape will turn it into:

%u201Chello%20world%u201D

Then jQuery will apply standard URL encoding on that before sending and it will become:

%25u201Chello%2520world%25u201D

So simply change your jsonToSend to:

var jsonToSend= { serial: serial, critiqueText: critiqueText) } ;

Which results in

%E2%80%9Chello%20world%E2%80%9D

I.E. standard URL encoding, you can even point your browser to http://en.wikipedia.org/wiki/%E2%80%9Chello%20world%E2%80%9D


Note, it's likely that Classic ASP won't recognize standard URL encoding, so here's a function to apply Win1252 URL encoding:
var map = {
    0x20AC: 128,
    0x201A: 130,
    0x0192: 131,
    0x201E: 132,
    0x2026: 133,
    0x2020: 134,
    0x2021: 135,
    0x02C6: 136,
    0x2030: 137,
    0x0160: 138,
    0x2039: 139,
    0x0152: 140,
    0x017D: 142,
    0x2018: 145,
    0x2019: 146,
    0x201C: 147,
    0x201D: 148,
    0x2022: 149,
    0x2013: 150,
    0x2014: 151,
    0x02DC: 152,
    0x2122: 153,
    0x0161: 154,
    0x203A: 155,
    0x0153: 156,
    0x017E: 158,
    0x0178: 159
};

function urlEncodeWin1252( str ) {
    return escape( str.replace( /[\d\D]/g, function(m){
        var cc = m.charCodeAt(0);
        if( cc in map ) {
            return String.fromCharCode(map[cc]);
        }
        return m;
    }));
}

You still can't have jQuery double encoding the result from this, so pass it a plain string:

var jsonToSend= "serial=" + serial + "&critiqueText="  urlEncodeWin1252(critiqueText);

Which will result in:

serial=123&critiqueText=%93hello%20world%94

You might want to rename that variable, there is no JSON anywhere.

Esailija
  • 138,174
  • 23
  • 272
  • 326
0

I deal with importing of crazy characters into SQL all day long and nvarchar is the way to go. Unless they're numbers or something of that sort I set the columns to nvarchar(max) so I won't have to deal with it. The only exception you have to keep in mind is if you're going to use Foreign Keys then you'll have to set it to nvarchar(450). This handles all kinds of crazy characters, spacing, and gaps in text as the result of tabs.

Techie Joe
  • 847
  • 2
  • 14
  • 32