SQL Server 2008 and Unicode Character comparison

Question

It seems that SQL Server 2008 removes some unicode characters when comparing two strings. Consider the following table:

CREATE TABLE [dbo].[Test](
[text] [nvarchar](50) NOT NULL,
  CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED 
  (
    [text] ASC
  ))

And now if I insert some rows with unicode characters:

insert into Test values(N'it᧠')
insert into Test values(N'it')

I get a unique constraint exception, even though the values are different. I'm using the default database collation here, which is SQL_Latin1_General_CP1_CI_AS.

Violation of PRIMARY KEY constraint 'PK_Test'. Cannot insert duplicate key in object 'dbo.Test'.

Note, that this doesn't happen for all unicode characters, but only for some characters, but I haven't been able to identify which unicode ranges exactly are problematic. For instance, the dingbat 0x2757 (❗) is removed in comparison, but 0x2764 (♥) is not. I guess it has something to do with 0x2757 being from a newer unicode standard.

So the question is, is there any way to make SQL Server 2008 work with these characters, or alternatively, can I programatically detect them (in C#, via unicode ranges or some such) and remove them beforehand?

Check this out http://stackoverflow.com/questions/4144767/unicode-characters-in-sql-table — Bayeni, Oct 28 '14 at 08:16
@Bayeni, doesn't really help. I am using nvarchar, and the N prefix. The unicode values appear in the table correctly, the problem is with comparisons. — Doron Yaacoby, Oct 28 '14 at 08:22
I am not 100% sure on why certain characters are not compared, or how to identify these, however you can work around this using a binary collation, e.g. the following works fine - `CREATE TABLE #T (A NVARCHAR(50) COLLATE Latin1_General_BIN PRIMARY KEY); INSERT #T (A) VALUES (N'it᧠'); INSERT #T (A) VALUES (N'it');` — GarethD, Oct 28 '14 at 08:28
@GarethD, thanks, that's an interesting idea. Any disadvantage I should be aware of when using binary collation? I don't have a problem with it being case sensitive, the field will only have lower case values. — Doron Yaacoby, Oct 28 '14 at 08:37
I don't think so, but if you are concerned you can switch to a different collation that is not binary, e.g. `Latin1_General_100_CI_AS`. The answer I have given is not the best, and doesn't really address the actual question, but I thought the SQL to generate the potential collations was useful and it didn't fit in a comment. — GarethD, Oct 28 '14 at 08:52

score 1 · Accepted Answer · answered Oct 28 '14 at 08:50

Okay, so a bit more digging shows this is almost certainly due to newer character, since this also works with the sql server 2008 equivalents of latin collation, but not the older versions, i.e. works with Latin1_General_100_CI_AS, but not with Latin1_General_CI_AS. To get a full list of the collations that correctly compare these strings I used:

IF OBJECT_ID('Tempdb..#T') IS NOT NULL
    DROP TABLE #T;
IF OBJECT_ID('Tempdb..#V') IS NOT NULL
    DROP TABLE #V;

CREATE TABLE #V (A NVARCHAR(50), B NVARCHAR(50));
INSERT #V (A, B) VALUES (N'it᧠', N'it');

CREATE TABLE #T (Collation VARCHAR(500), Match BIT);

DECLARE @SQL NVARCHAR(MAX) = (SELECT N'INSERT #T (Collation, Match) 
                                        SELECT ''' + Name + ''', CASE WHEN A = B COLLATE ' + name + ' THEN 1 ELSE 0 END
                                        FROM    #V;'
                                FROM sys.fn_helpcollations()
                                FOR XML PATH(''), TYPE
                            ).value('.', 'NVARCHAR(MAX)');

EXECUTE sp_executesql @SQL;

SELECT  *
FROM    #T
WHERE   Match = 0;

SQL Server 2008 and Unicode Character comparison

1 Answers1

Linked