I ran into a case SQL server can store "sofia" and "sofia" as two different strings, but when compared in TSQL, they are the same whatever the COLLATE is used even if the binary Collate:
CREATE TABLE #R (NAME NvarchAR(255) COLLATE SQL_Latin1_General_CP1_CI_AS)
INSERT INTO #R VALUES (N'sofia')
INSERT INTO #r VALUES (N'sofia')
SELECT * FROM #r WHERE NAME = N'sofia'
sofia
sofia
(2 row(s) affected)
IF 'sofia' = 'sofia' COLLATE SQL_Latin1_General_CP1_CI_AS
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
IF 'sofia' = 'sofia' COLLATE SQL_Latin1_General_CP437_BIN
SELECT 'Values are the same'
ELSE
SELECT 'Values are different'
-------------------
Values are the same
(1 row(s) affected)
I tried to find out the encode of "sofia"
http://stackoverflow.com/questions/1025332/determine-a-strings-encoding-in-c-sharp
It said:
// If all else fails, the encoding is probably (though certainly not
// definitely) the user's local codepage! One might present to the user a
// list of alternative encodings as shown here: http://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language
// A full list can be found using Encoding.GetEncodings();
I iterate through all the encoding returned from Encoding.GetEncodings(), none of them match
Looking into the binary I found an interesting fact: “sofia” itself is encoded with UTF16, but it can be generated from "SOFIA" UTF16 by filling “1” instead of “0” in the extra byte besides ASCII code (Ex for ‘S’: 83 255 vs 83 0) It is shown as lower case. In C#,
“sofia”
[0] 83 byte
[1] 255 byte
[2] 79 byte
[3] 255 byte
[4] 70 byte
[5] 255 byte
[6] 73 byte
[7] 255 byte
[8] 65 byte
[9] 255 byte
"SOFIA"
[0] 83 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 73 byte
[7] 0 byte
[8] 65 byte
[9] 0 byte
"sofia"
[0] 115 byte
[1] 0 byte
[2] 79 byte
[3] 0 byte
[4] 70 byte
[5] 0 byte
[6] 105 byte
[7] 0 byte
[8] 97 byte
[9] 0 byte
One can create two different directorie/files with name as C:\sofia\, C:\sofia\ or sofia.txt, sofia.txt.
Why does the SQL engine think they are the same while storing them with the original streams?
In order to get just the exact I want I had to convert to binary first:
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
SELECT * FROM #r WHERE CONVERT(VARBINARY(100), Name) = CONVERT(VARBINARY(100), N'sofia')
sofia
(1 row(s) affected)
But this has many side effects, such as culture and case. How can I teach TSQL Engine to know they are different without much cost?
Is there an official name of this kind of string encoding ?