Char returns the wrong value for 29 unicode characters - Need .NET cast / convert of nchar to char

Question

Need a .NET cast / convert of the SQL nchar to char.
More specifically cast of the nchar UNICODE to the char ASCII.

Where this is complicated is SQL char uses the full byte.
Not the pure ASCII of 128.
The TSQL function ASCII returns 0-255.

Ideally there would be a NormalizationForm of FormByte.
It would not be an exact textual value - rather a close logical value or ?.
And SQL would use the FormByte to cast from nchar to char.
NormalizationForm

Encode Decode did not work for me and I tried all flavors.

In SQL many char (byte) are mapped to 63. 63 is ?.
Not just char over 255 that are mapped to 63.
130 to 140 are all mapped to 63.

Characters 160-255 all return 160-255

Over 255 not all are mapped to 63.
For example many diacritics are mapped to the ASCII.

TSQL has UNICODE and ACSII functions.
So I just loaded all Unicode characters into both char and nchar columns.

The char returned by SQL is wrong for 29 characters.
And the ASCII() returned for the bad chars does not make sense - all control chars in the 130 - 160 range.
Checked the binary for the incorrect 29 and what is is stored is what is returned by ASCII().
For 27 what is being return from char is nchar and for 2 it is not even the correct nchar. They should all be mapped to ? or a an ACSII equivalent.
“ and ” mapped to " (but will take ?) ‘ and ’ mapped to '
– (en dash) and — (em dash) mapped to -
...

I know you don't believe me.
Insert 'Œ' into a char column and select it - it will return 'Œ'.
And you can even search on it - char = 'Œ' returns true.
Select ASCII('Œ') returns 140 and that is what is actually stored (check the binary).
The UNICODE definition of 140 / 8C is Partial Line Backward.
I checked the binary value for that char and it is 8C (140).
What is returned is the unicode 'Œ' Int16 338.
It appears SQL is doing some input output mapping and getting it wrong.

ASCII function is correct for 575 unicode characters that are not mapped to ?.
The char value matches the ACSII and they all make sense.
E.G. 12 different forms of u are all mapped to u.
32163 characters other than ? are mapped to ? (63).

Below are the 29 char that return the wrong value.
Column order:
char
nchar
ASCII(char)
UNICODE(nchar)

     sqlCharASCIIbackToString did not match  Œ Œ 140 338
     sqlCharASCIIbackToString did not match  œ œ 156 339
     sqlCharASCIIbackToString did not match  Š Š 138 352
     sqlCharASCIIbackToString did not match  š š 154 353
     sqlCharASCIIbackToString did not match  Ÿ Ÿ 159 376
     sqlCharASCIIbackToString did not match  Ž Ž 142 381
     sqlCharASCIIbackToString did not match  ž ž 158 382
     sqlCharASCIIbackToString did not match  ƒ Ƒ 131 401
     sqlCharASCIIbackToString did not match  ƒ ƒ 131 402
     sqlCharASCIIbackToString did not match  ˆ ˆ 136 710
     sqlCharASCIIbackToString did not match  ˜ ˜ 152 732
     sqlCharASCIIbackToString did not match  – – 150 8211
     sqlCharASCIIbackToString did not match  — — 151 8212
     sqlCharASCIIbackToString did not match  ‘ ‘ 145 8216
     sqlCharASCIIbackToString did not match  ’ ’ 146 8217
     sqlCharASCIIbackToString did not match  ‚ ‚ 130 8218
     sqlCharASCIIbackToString did not match  “ “ 147 8220
     sqlCharASCIIbackToString did not match  ” ” 148 8221
     sqlCharASCIIbackToString did not match  „ „ 132 8222
     sqlCharASCIIbackToString did not match  † † 134 8224
     sqlCharASCIIbackToString did not match  ‡ ‡ 135 8225
     sqlCharASCIIbackToString did not match  • • 149 8226
     sqlCharASCIIbackToString did not match 
     … … 133 8230
     sqlCharASCIIbackToString did not match  ‰ ‰ 137 8240
     sqlCharASCIIbackToString did not match  ‹ ‹ 139 8249
     sqlCharASCIIbackToString did not match  › › 155 8250
     sqlCharASCIIbackToString did not match  € € 128 8364
     sqlCharASCIIbackToString did not match  ™ ™ 153 8482
     sqlCharASCIIbackToString did not match  ˜ ≈ 152 8776
     count63 =  32163 countMis =  29 countCorrect =  575

Ran the following .NET to see which 'Œ' if being returned by SQL

char char338 = (char)338;
System.Diagnostics.Debug.WriteLine(char338);
sqlCmd.CommandText = "select [char] from [charNchar] where [char] = @char;";
sqlCmd.Parameters.Add("@char", SqlDbType.Char).Value = char338;
string string338= sqlCmd.ExecuteScalar().ToString();
char338 = string338.ToCharArray()[0];
System.Diagnostics.Debug.WriteLine(char338 + " " + ((Int16)char338).ToString());

The above code returns Œ 338.
SQL is returning a value larger than byte to datatype is supposedly stored as byte.
If I search on (char)140 then ? 63 is returned.

What is interesting is a search on 'Œ' versus N'Œ' on char produce different results.
That is searching on the left (140) Œ.
Search on the right (338) Œ the char search finds nothing.
Nchar finds both results with either input.

  SELECT [int16],RTRIM([char]) as [char], ASCII([char]) as 'ASCII'
                ,RTRIM([nchar]) as [nchar], UNICODE([nchar]) as 'UNICODE'
  FROM [test].[dbo].[charNchar]
  where [char] = 'Œ'
  SELECT [int16],RTRIM([char]) as [char], ASCII([char]) as 'ASCII'
                ,RTRIM([nchar]) as [nchar], UNICODE([nchar]) as 'UNICODE'
  FROM [test].[dbo].[charNchar]
  where [char] = N'Œ'
  SELECT [int16],RTRIM([char]) as [char], ASCII([char]) as 'ASCII'
                ,RTRIM([nchar]) as [nchar], UNICODE([nchar]) as 'UNICODE'
  FROM [test].[dbo].[charNchar]
  where [nchar] = 'Œ'
  SELECT [int16],RTRIM([char]) as [char], ASCII([char]) as 'ASCII'
                ,RTRIM([nchar]) as [nchar], UNICODE([nchar]) as 'UNICODE'
  FROM [test].[dbo].[charNchar]
  where [nchar] = N'Œ'


int16  char                                               ASCII       nchar                                              UNICODE
------ -------------------------------------------------- ----------- -------------------------------------------------- -----------
338    Œ                                                  140         Œ                                                  338

int16  char                                               ASCII       nchar                                              UNICODE
------ -------------------------------------------------- ----------- -------------------------------------------------- -----------
338    Œ                                                  140         Œ                                                  338
339    œ                                                  156         œ                                                  339

int16  char                                               ASCII       nchar                                              UNICODE
------ -------------------------------------------------- ----------- -------------------------------------------------- -----------
338    Œ                                                  140         Œ                                                  338
339    œ                                                  156         œ                                                  339

int16  char                                               ASCII       nchar                                              UNICODE
------ -------------------------------------------------- ----------- -------------------------------------------------- -----------
338    Œ                                                  140         Œ                                                  338
339    œ                                                  156         œ                                                  339

≈ search finds nothing with any of the four queries. Checked the charts and that is the correct char for 8776 and is math almost equal to.

˜ is zero width pasted into SSMS but it is something as if it is pasted into the FROM the blue turn to black.

Am I missing something - this seems like a bug to me.
It is not just he wrong value it is an invalid value.
An Int16 is returned.
Lets say I wanted to use byte to store character to save space - it would break on SQL char as 29 chars are not returned as byte.

Here is the code I used:

public void SQLchar()
{

    SqlConnection sqlCon = new SqlConnection(connString);  
    try
    {         
        sqlCon.Open();
        SqlCommand sqlCmd = sqlCon.CreateCommand();
        SqlDataReader rdr;
        sqlCmd.CommandText = "delete charNchar";
        sqlCmd.ExecuteNonQuery();
        for(Int16 i = 0; i < Int16.MaxValue; i ++)
        {
            sqlCmd.CommandText = "insert into charNchar (int16,char,nchar) values (@int16, @char, @nchar);";
            sqlCmd.CommandType = System.Data.CommandType.Text;
            sqlCmd.Parameters.Clear();
            sqlCmd.Parameters.Add("@int16", SqlDbType.Int).Value = i;
            sqlCmd.Parameters.Add("@char", SqlDbType.Char).Value = (char)i;
            sqlCmd.Parameters.Add("@nchar", SqlDbType.NChar).Value = (char)i;
            sqlCmd.ExecuteNonQuery();
        }
        string sqlChar;
        string sqlNChar;
        Int16 sqlCharASCII;
        Int16 sqlNCharUnicode;
        string sqlCharASCIIbackToString;
        sqlCmd.CommandText = "select char,nchar,ASCII(char),UNICODE(nchar) from charNchar order by int16;";
        rdr = sqlCmd.ExecuteReader();
        Int16 count63 = 0;
        Int16 countMis = 0;
        Int16 countCorrect = 0;
        while (rdr.Read())
        {
            sqlChar = rdr.IsDBNull(0) ? "dbNull" : rdr.GetString(0);
            sqlNChar = rdr.IsDBNull(1) ? "dbNull" : rdr.GetString(1);
            sqlCharASCII = rdr.IsDBNull(2) ? Int16.Parse("-1") : (Int16)rdr.GetInt32(2);
            sqlNCharUnicode = rdr.IsDBNull(3) ? Int16.Parse("-1") : (Int16)rdr.GetInt32(3);
            if(sqlCharASCII == 63 && sqlNCharUnicode != 63)
            {
                count63 ++;
                continue;  // ?
            }
            if (sqlCharASCII < 0)
            {
                System.Diagnostics.Debug.WriteLine("ASCII(char) null for " + sqlChar + " " + sqlNChar);
            }
            else
            {
                sqlCharASCIIbackToString = ((char)sqlCharASCII).ToString();
                if (string.CompareOrdinal(sqlChar, sqlCharASCIIbackToString) != 0)
                {
                    countMis++;
                    System.Diagnostics.Debug.WriteLine(" sqlCharASCIIbackToString did not match " + sqlCharASCIIbackToString + " " + sqlChar + " " + sqlNChar + " " + sqlCharASCII + " " + sqlNCharUnicode);
                }
                else
                {
                    countCorrect++;
                }
            }
        }
        rdr.Close();
        System.Diagnostics.Debug.WriteLine("count63 =  " + count63.ToString() + " countMis =  " + countMis.ToString() + " countCorrect =  " + countCorrect.ToString());
    }
    catch (Exception Ex)
    {
        System.Diagnostics.Debug.WriteLine(Ex.Message);
    }
    finally 
    {
        sqlCon.Close();
    }
}

As for why.
Parse string data in .NET and that data is a FK.
Rather than round trip to SQL to get the ID for the FK use a .NET Dictionary for speed.
The Dictionary is a reverse lookup to to get the key from the value.
The parser has the Int16 of the char as that is already used by the parser.
So if the ASCII of the char is wrong then the reverse lookup fails.
I think I could hard code fixes for the incorrect ASCII results.
But I want to understand what is going on here before I go down a path that starts with patches.
Does char have some fundamental flaws?
Could just use nchar but we prefer char.
The nature of the application is we want matches.
6 diacritics of u to all match ascii u is a good thing.

@muratgu The insert that fails a unique constraint. The Dictionary comparison of Unicode to Char determined the value was not in SQL. the insert that converted the Unicode to Char determined there was duplicate. — paparazzo, Apr 13 '13 at 00:40
@muratgu And technically it was not the Dictionary comparison - it was the string comparer. And I initialize the Dictionary with a CultureInsenstive CaseInsensitive string comparer. — paparazzo, Apr 13 '13 at 00:42
In SQL Server, single-byte character sets (so-called "extended ASCII" data that encodes to decimal 0-255, which is used for datatypes char, varchar, and text) have an encoding that depend on the object's character set. Unicode data is used for nchar, nvarchar, and ntext. You should not expect single-byte character sets to encode with the same values as Unicode. They are not supposed to. The historical reason for non-Unicode single-byte character sets was so that databases could support "ASCII plus your local non-ASCII characters" before Unicode was available. — chris, Apr 16 '13 at 04:48
In other words, single-byte character sets enabled a database to define code points 128-255 in the way that was most useful locally. — chris, Apr 16 '13 at 04:55

Esailija · Accepted Answer · 2013-04-16T15:25:28.287

11

You are massively confusing code point values and encoded byte values.

The code point U+0152 (338 or Œ), is encoded in Windows-1252 as the byte 0x8C or 140 in decimal, that's what the badly named ASCII() function returns you. It is just a coincidence that many code points in Windows-1252 are encoded in a way that the code point being encoded has the same value as the encoded byte's value for that code point.

Windows-1252 can only encode:

0-127
160-255

And these that don't go neatly in a range:

338,339,352,353,376,381,382,402,
710,732,8211,8212,8216,8217,8218,
8220,8221,8222,8224,8225,8226,
8230,8240,8249,8250,8364,8482

None of the code points in the second batch will be encoded in byte value<->code point value which is what you seem to be expecting.

Windows-1252 cannot encode the range 128-159, so attempting to convert anything in that range (e.g. 130 or 140) just gets encoded as ? or 0x3F. That range is pretty much useless C1 control characters anyway.

It also doesn't utilize the full 256 character space it has, it only encodes 251 different characters. So you cannot use it as pseudo-bytes because 5 bytes are invalid Windows-1252. If that's what you tried to do, it won't work.

It is not actually clear what high level thing you are even trying to do so I'll have a guess.

If you wanted to match accent-insensitively, then just use an accent-insensitive collation. Then ü,ú, ù etc will all match u. Nothing to do with encodings.

CREATE TABLE Mytable (
    Mycolumn NVARCHAR(10) COLLATE Latin1_General_CI_AI
)

INSERT INTO Mytable (myColumn) VALUES( 'ü' ), ('ú'), ( 'ù' )

SELECT Mycolumn
FROM Mytable
WHERE Mycolumn = 'u'

--Results

MYCOLUMN
ü
ú
ù

Here's a demo http://sqlfiddle.com/#!3/67752/2.

To convert the SQLAscii to 'Œ', try this:

public static char Windows1252CPtoChar(int cp)
{
    Encoding win1252 = Encoding.GetEncoding("Windows-1252"); //this could be made static
    return win1252.GetString(new byte[] { (byte)cp })[0];
}

public static void Main(string[] args) {
    Console.WriteLine(Windows1252CPtoChar(140) == 'Œ');
}

So instead of:

sqlCharASCIIbackToString = ((char)sqlCharASCII).ToString();

Do

sqlCharASCIIbackToString = (Windows1252CPtoChar(sqlCharASCII)).ToString();

edited Apr 16 '13 at 15:25

answered Apr 16 '13 at 11:01

Esailija

138,174
23
272
326

Read the question. SQL char does NOT insert those 29 as ? - in the table I show what value is is inserted. I specifically state u is one char SQL gets right. And your answer is see it get u right for NVARCHAR. See paragraph 10 - insert 'Œ' in a char (not nchar). – paparazzo Apr 16 '13 at 14:43
@Blam but is accent insensitive what you want to do ? If you try to insert `(char)130` don't you get `?` ?. That's only what I claimed - Windows-1252 supports `338` so no `?`. – Esailija Apr 16 '13 at 14:46
@Blam Windows-1252 encodes `Œ` (338) as 0x8C (140 decimal), you can see that in http://en.wikipedia.org/wiki/Windows-1252. – Esailija Apr 16 '13 at 14:59
@Blam That depends on your collation settings. My example works just the same with VARCHAR. http://sqlfiddle.com/#!3/1699a/1 – Esailija Apr 16 '13 at 15:01
OK now I think I follow, thanks for enduring. Please see sqlCharASCIIbackToString =. I think the error is that cast. How do could I get a Windows-1252 cast from int to char? As for using nchar I want to use char so I get matches on 'ü' and 'ú' and to save space. – paparazzo Apr 16 '13 at 15:12
@Blam So you have, say, the int `140`, and you want the char `Œ` from that int? in C#? Getting matches on accented `u's` depends on the collation - you don't need to do anything but change the collation. You are using accent sensitive collation right now and all you need to do is to change that to accent insensitive one. – Esailija Apr 16 '13 at 15:16
I need to not just get CE I need the right CE. This comparison evaluates them as not the same string.CompareOrdinal(sqlChar, sqlCharASCIIbackToString). – paparazzo Apr 16 '13 at 15:20
@Blam ok I have edited code to the bottom of my answer, let me know of the results if it's not working. – Esailija Apr 16 '13 at 15:26
AWESOME! I am so happy to give the 100 points. But I am also rethinking my strategy based on you collation comment. You already to the check. But if I use nchar and want as many matches as possible then what collation would you recommend. And would a Normalize prior to inset help - if so what form? And thanks again. – paparazzo Apr 16 '13 at 15:44
@Blam If you want to remove accents from a string, use something like [this](http://stackoverflow.com/a/3769995/995876) but it removes information and is not necessary with accent-insensitive collation. I am not sure what collation works the best for you since I am not familiar with SQL Server. Basically I figured out that `AI` at the end means "accent-insensitive" and `AS` means "accent-sensitive". I used `Latin1_General_CI_AI` in my example which is case and accent insensitive. – Esailija Apr 16 '13 at 15:48
Check the comments on that link. I am already using that. Now I am going to extend my SQL char "cleanup" to include "Windows-1252". – paparazzo Apr 16 '13 at 17:17

score 2 · Answer 2 · answered Apr 13 '13 at 01:37

2

So, I'm going to assume here that what you are trying to do is exploit the fact that you are getting unicode data and storing it into a varchar field in the database... You'll want to get the input encoding as an Ascii string

string bad = Encoding.ASCII.GetString(Encoding.Unicode.GetBytes(input));

answered Apr 13 '13 at 01:37

Darren Kopp

76,581
9
79
93

+1 Let me test this out before I give it a check. If it gets me 99% of the way there I am happy. – paparazzo Apr 13 '13 at 02:42
Still +1 as it was worth a try but did not work. See update to the question. – paparazzo Apr 13 '13 at 13:16
yeah... that's a common side affect of unicode. you could likely get what you are wanting via Encoding.UTF8 vs Encoding.Unicode as UTF8 only uses 2 bytes vs 4 bytes (which is the space you are seeing) – Darren Kopp Apr 13 '13 at 18:08
[Check out the docs on Encoding.ASCII](http://msdn.microsoft.com/en-us/library/system.text.encoding.ascii.aspx) as I think you should technically be able to get it to work like you want. – Darren Kopp Apr 13 '13 at 18:11
I tried all flavors of Encode Decode. It gets it right for 0-127. char cheats and uses 0-255. – paparazzo Apr 14 '13 at 15:31
well, if there isn't anything baked in that does what you want, you can always dump the string into a `StringBuilder` class and go through each character and if it is not in the range of 0 - 255, replace that character with `?` – Darren Kopp Apr 14 '13 at 23:09
Oh no I don't want to do that. One it even makes changes under 255. And over 255 I like almost all the changes - 6 forms of U mapped to latin U and 6 forms of u mapped to latin u. I just need to be able to predict what it is going to. – paparazzo Apr 14 '13 at 23:17

score 0 · Answer 3 · edited May 23 '17 at 10:25

I'd hazard a guess you're running into conversion oddities involving the Code Page in use on either your local system or your SQL server system (this affects how the so-called "High-ASCII" decimal 128-255 characters are interpreted) and the collation in use on your column/table/database. SQL Server will try to convert 'invalid' characters to valid CHARs based on the Code Page & collation in use.

We ran into an issue way back when when a European client tried to load a name with 'ü' in it using a file set up on a system with a European Code Page active. Our SQL Server setup with code page 437 active (OEM CP setting, IIRC) stored and returned '÷'. (decimal 246) Once the code page issue was settled for CHAR data, everything was hunky-dory. I don't recall the particulars of the input & output setups, sorry.

EDIT: This Stackoverflow article talks about some of the issues pretty well, including the '?' thing.

SQL server is SQL_Latin1_General_CP1_CI_AS. 29 characters are not returning the actual binary value and not even returning a byte. Pretty sure the server is 1252. All defaults. — paparazzo, Apr 15 '13 at 20:11

Char returns the wrong value for 29 unicode characters - Need .NET cast / convert of nchar to char

3 Answers3

Linked