2

I've stumbled across some interesting behaviour when working with some data from China. For whatever reason, they have records where a traditional open bracket (U+0028) is mixed up with a "full width" open bracket (U+FF08). Now, C# knows they are not equal by testing:

char a = Convert.ToChar(0x0028);
char b = Convert.ToChar(0xFF08);

string g = Convert.ToString(Convert.ToChar(0x0028));
string h = Convert.ToString(Convert.ToChar(0xFF08));

if (a == b) throw new Exception();
if (g == h) throw new Exception();

Excel also knows they are not equivalent, but SQL 2008 thinks they are equivalent:

--should be false and return nothing
select 1 where N'(' = N'(' 

--is false and returns nothing
select 1 where N'租' = N'(' 

--should be false and return nothing
select 1 where NCHAR(0x0028) = NCHAR(0xFF08) 

So, I think I'm missing something here, but is there a Unicode specific equality operation that would either explicitly support this equality (as opposed to logically being not equal). I'd be interested too as to how other languages treat this behaviour, as I'm using Qlik with this data, and Qlik's internal engine recognizes the two strings as being equivalent when searching, but not when loading data (yay consistency!)

I'd also guess the same behaviour exists for the closing brackets, and other punctuation, but if there's a specific SQL equality operator (or in other languages), that'd solve this problem regardless.

There is a full list of punctuation characters in Wikipedia here, and a similar issue of accented character equality


From @ZLK, the point with SQL is collation. Using code from here SQL Server 2008 and Unicode Character comparison looks like all _BIN, _BIN2 and _WS (Width-Sensitive!) collations work as expected.

David R Tribble
  • 11,918
  • 5
  • 42
  • 52
David M
  • 124
  • 5
  • 1
    It has to do with collation, e.g. if you use a BIN/BIN2 collation like Latin1_General_100_BIN2 you should get your expected results. For example, `select 1 where N'(' collate Latin1_General_100_BIN2 = N'(' collate Latin1_General_100_BIN2` will not return a result. – ZLK May 26 '17 at 00:21
  • Thanks @ZLK, I didn't think of collation. Looking around for collation-based questions I see user paul-white has answered similar questions twice in different forums: https://dba.stackexchange.com/questions/110911/latin1-general-bin-performance-impact-when-changing-the-database-default-collate https://www.sqlservercentral.com/Forums/Topic901338-391-1.aspx – David M May 26 '17 at 00:53
  • https://stackoverflow.com/questions/26603391/sql-server-2008-and-unicode-character-comparison?rq=1 as well – David M May 26 '17 at 00:54

0 Answers0