Why different sort order with unicode?

Question

I have this query:

select ' ' C union
select '*' C union
select '-' C
order by C

And the result is space, asterisk and dash, but if I have unicode characters like this:

select N' ' C union
select N'*' C union
select N'-' C
order by C

I get space, dash, asterisk. Can anyone explain why?

Thanks!

score 3 · Accepted Answer · edited May 23 '17 at 10:25

3

I thought at first that this would be down to different default collations for varchar and nvarchar, but that doesn't seem to be the case. It does seem to vary a bit by collation (I don't see it with Latin1_General_CI_AS but I do if I use SQL_Latin1_General_CP1_CI_AS).

Looking into it further, I found this answer here on Stack Overflow, which references this article on MSDN, which has this to say about hyphens and Unicode:

A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules. Because the comparison rules for non-Unicode and Unicode data are different, when you use a SQL collation you might see different results for comparisons of the same characters, depending on the underlying data type. For example, if you are using the SQL collation "SQL_Latin1_General_CP1_CI_AS", the non-Unicode string 'a-c' is less than the string 'ab' because the hyphen ("-") is sorted as a separate character that comes before "b". However, if you convert these strings to Unicode and you perform the same comparison, the Unicode string N'a-c' is considered to be greater than N'ab' because the Unicode sorting rules use a "word sort" that ignores the hyphen.

So I've marked this answer a CW, as this is really just a duplicate of that answer (and the question a a duplicate of that question).

edited May 23 '17 at 10:25

Community

1
1

answered Aug 06 '13 at 13:14

T.J. Crowder

1,031,962
187
1,923
1,875

I observe this behavior with the same collation (do a `SELECT INTO` for each query; on a default installation both columns get `SQL_Latin1_General_CP1_CI_AS`). – Aaron Bertrand Aug 06 '13 at 13:17
@AaronBertrand: That's interesting, I don't see the behavior (I get the same order in both cases: space, dash, asterisk): http://pastie.org/8211467 For me, the columns end up being `Latin1_General_CI_AS`. – T.J. Crowder Aug 06 '13 at 13:26
@AaronBertrand: But I do if I use `SQL_Latin1_General_CP1_CI_AS`. – T.J. Crowder Aug 06 '13 at 13:29
@AaronBertrand: I found the answer...here on SO. :-) – T.J. Crowder Aug 06 '13 at 13:32
1

Nice find. Confusing that in the *same* collation they sort differently. – Aaron Bertrand Aug 06 '13 at 13:34
@AaronBertrand: Yeah, agreed. – T.J. Crowder Aug 06 '13 at 13:35

Why different sort order with unicode?

1 Answers1