4

https://stackoverflow.com/a/361059/14731 discusses the differences between "word sort" and "string sort".

How does one query programmatically when an SQL Collation will use "word sort" vs "string sort"?

Corollary: Do all collations use "word sort" for Unicode strings and "string sort" for non-Unicode strings?

SELECT * from sys.fn_HelpCollations()
WHERE name = 'SQL_Latin1_General_CP1_CI_AS'

provides a lot of details about the collation, but notice that it makes no mention of "word sort".

Community
  • 1
  • 1
Gili
  • 86,244
  • 97
  • 390
  • 689

2 Answers2

2

Let's start with the definition of these types of sorts as given by Microsoft (taken from the "Remarks" section of the CompareOptions Enumeration MSDN page):

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

Unicode is culturally-sensitive and weighted, and the XML and N-prefixed types are Unicode, so they could be saying that data in Unicode types use "word sort" while data in the non-Unicode types uses "string sort". Ordinal refers to the BIN and BIN2 collations, though the BIN collations aren't 100% ordinal due to how they treat the first character.

But let's see what SQL Server says it is doing. Run the following:

DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
INSERT INTO @SampleData (ANSI, UTF16) VALUES 
    ('a-b-c', N'a-b-c'),
    ('ac', N'ac'),
    ('aba', N'aba'),
    ('a-b', N'a-b'),
    ('ab', N'ab');

SELECT sd.ANSI AS [ANSI-Latin1_General_100_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE Latin1_General_100_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-Latin1_General_100_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE Latin1_General_100_CI_AS ASC;

SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
FROM   @SampleData sd
ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

Results:

ANSI-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

UTF16-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

ANSI-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
a-b
a-b-c
ab
aba
ac

UTF16-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac

Hmm. Only the SQL_ collation in combination with the VARCHAR field appears to be doing what could be considered "string sort". It makes sense that the SQL_ collation in combination with the NVARCHAR field would do "word sort" it is the same Unicode handling as the non-SQL_ collations. But is there something besides being a SQL Server collation (i.e. starting with SQL_ ) that determines "string" vs "word" sort? Let's look at the only properties of the collations that we can extract:

SELECT N'Latin1_General_100_CI_AS' AS [CollationName],
       COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'CodePage') AS [CodePage],
       COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'LCID') AS [LCID],
      COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'ComparisonStyle') AS [ComparisonStyle]
UNION ALL
SELECT N'SQL_Latin1_General_CP1_CI_AS' AS [CollationName],
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'CodePage'),
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'LCID'),
       COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'ComparisonStyle');

Results:

CollationName                  CodePage   LCID    ComparisonStyle
----------------------------   --------   ----    ---------------
Latin1_General_100_CI_AS       1252       1033    196609
SQL_Latin1_General_CP1_CI_AS   1252       1033    196609

So, there are no discernible differences there. Which seems to leave us with this:

String sort is done when:

  • the collation name starts with SQL_, AND
  • the data (field, variable, string literal) is non-Unicode (i.e. CHAR / VARCHAR / TEXT)

For more information on Unicode sorting in general, check out the following resources:

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • Thanks for the detailed answer, but it unfortunately is not enough. While http://www.unicode.org/reports/tr10/ mentions "ignorable characters" I can't figure out where to look up the list of ignorable characters per collation. For example when `SQL_Latin1_General_CP1_CI_AS` is in unicode mode, hyphens are supposed to be ignorable. Isn't there a lookup table for this kind of information, per collation? – Gili Aug 26 '15 at 18:47
  • @Gili I just answered your related question and it might have additional details that provide insight to your follow-up question. For any Unicode (this has nothing to do with `SQL_Latin1_General_CP1_CI_AS`, it has to do with using NVARCHAR / NCHAR / XML) sorting, the collation rules are the UCA algorithm with any locale-specific rules applied. I don't know of any means to "discover" how a character will be treated since the algorithms take into account context (i.e. surrounding characters). – Solomon Rutzky Aug 26 '15 at 19:37
  • @Gili Also, go to that Collation Charts link in my answer. Then click on the "Help" link towards the top in the left nav. That will describe the multiple levels of weights. Then in that left nav, you will see a link for "Punctuation" (4th one down). Does this get you closer? – Solomon Rutzky Aug 26 '15 at 20:56
  • Not really. I'll try to rephrase... The UCA algorithm discusses different comparison levels. Level 3 deals with case-sensitivity. Level 4 deals with ignorable punctuation. Where does it say that `SQL_Latin1_General_CP1_CI_AS` is mapped to comparison level 3 (ignoring punction)? Are you saying I should assume that *all* unicode comparisons use level 3? – Gili Aug 27 '15 at 20:59
  • @Gili The most important factor is the datatype. The `XML` and `N`-prefixed datatypes are Unicode-only. All Unicode sorting uses the "root" UCA rules except for any potential locale-specific overrides. Initially they are all "everything"-sensitive and use all levels, but the number of levels can vary by locale. The `AI` and `CI` options allow for completely ignoring the Accent and/or Case levels, respectively. .NET has an option for "IgnoreSymbols" that's not available in SQL Server. I think SQL Server collations are always Punctuation (i.e. Symbol)-sensitive. I will update this tomorrow. – Solomon Rutzky Aug 28 '15 at 06:48
  • @Gili I was able to do more research and testing. I mostly rewrote my answer in light of the new info. Please check it out. – Solomon Rutzky Aug 28 '15 at 21:17
  • 1
    Excellent detective work! I've accepted your answer. – Gili Sep 21 '15 at 19:17
  • Upon further investigation, I've discovered that this answer is missing some critical information. Your answer contains important pieces of the puzzle but I will post my own answer in an attempt to complement the missing pieces. – Gili Sep 21 '15 at 23:44
  • @Gili Slow down ;-) You need to give me a moment to respond. There is nothing missing from my answer here, at least not that I can see based on the two follow-up questions. If there is a question that you have, please ask and I can either update here or indicate that a follow-up question should be posted. As it stands, the http://stackoverflow.com/questions/32705178/why-doesnt-icu4j-match-utf-8-sort-order question was implicitly addressed in my answer here since I already linked to the same info that MarkRansom did. I will answer the other two in about 30 minutes after the kids are asleep. – Solomon Rutzky Sep 22 '15 at 00:09
  • No worries. Take a look at http://stackoverflow.com/a/32706510/14731 when you get a chance. I believe it addresses some points that I didn't get/understand from your answer. – Gili Sep 22 '15 at 00:28
1
  • srutzky's excellent answer reveals that, with the exception of non-Unicode types processed by SQL_ collators, all other data is sorted according to "Unicode Collation" rules.
  • Confusingly, Microsoft does not use the Unicode standard's sorting rules.
  • According to https://support.microsoft.com/en-us/kb/322112

    SQL Server 2000 supports two types of collations:

    • SQL collations
    • Windows collations

    [...]

    For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data.

    [...]

    A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules.

  • I interpret this as meaning that:

    • SQL_ collators are "SQL collations"
    • All other collators are "Windows collators".
    • With the exception of non-Unicode types processed by SQL_ collators, all other data is sorted according to "Windows collations".

So, let's dig into "Windows collations".

//  Sorting Flags.
//
//    WORD Sort:    culturally correct sort
//                  hyphen and apostrophe are special cased
//                  example: “coop” and “co-op” will sort together in a list
//
//                        co_op     <——-  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        co-op     <——-  hyphen (punctuation)
//                        cork
//                        went
//                        were
//                        we’re     <——-  apostrophe (punctuation)
//
//
//    STRING Sort:  hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <——-  hyphen (punctuation)
//                        co_op     <——-  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we’re     <——-  apostrophe (punctuation)
//                        went
//                        were
  • And finally, according to https://msdn.microsoft.com/en-us/library/windows/desktop/dd318144(v=vs.85).aspx

    [...] all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently from the other nonalphanumeric characters to ensure that words such as "coop" and "co-op" stay together in a sorted list.

Community
  • 1
  • 1
Gili
  • 86,244
  • 97
  • 390
  • 689
  • Hey there. A few notes. 1) "Microsoft does not use the Unicode standard's sorting rules." That is not entirely true, or at least it is too broad of a statement. Microsoft implemented both the Windows and the SQL collations. The SQL collations were only done because SQL Server, at the time, couldn't use the Windows collations. But for Unicode data (i.e. whatever is stored in the XML or N-prefixed types), they certainly do use the Unicode sorting rules. Keep in mind that the rules found at Unicode.org change more often than is reflected in software packages that implement their recommendations. – Solomon Rutzky Sep 22 '15 at 01:22
  • (cont.) **2)** "SQL Server 2000 supports two types of collations:" SQL Server 2000 is now 6 versions old. You should verify with current documentation. However, and fortunately, that information is still true. **3)** Your three point interpretation is correct, though just restating what is in my answer and what you stated at the top of this answer (except you swapped the word "Unicode" for "Windows"). – Solomon Rutzky Sep 22 '15 at 01:28
  • (cont.) **4)** "For Unicode data types, data comparisons are based on the Unicode code points." This quote is taken entirely out of context, and is hence incorrect on its own. The topic being discussed when that statement is made is specifically "Binary collations". They are saying that Binary collations sort Unicode data based on the Unicode code point, which might not be the same order for the same characters if looking at a locale's non-Unicode code page. But, the first two sentences of that paragraph explain how non-Binary collations handle sorting, which is in agreement with my answer. – Solomon Rutzky Sep 22 '15 at 01:35
  • (cont.) **5)** The final two items -- quoted sections of winnls.h and "Handling Sorting in Your Applications" -- seem to be in complete agreement with what I quoted and gave example code for. The only difference appears to be that they specify clearly "hyphen and apostrophe" as the special cases. If that is the only item missing from my answer, and if it is an important item, then I can easily add it, though it was not part of the original question. But it would be fine as a note on the bottom if it helps to clarify. – Solomon Rutzky Sep 22 '15 at 01:45
  • @srutzky How do you explain the fact that `_` is sorted before `#`? This doesn't seems to match the Unicode sorting order as mentioned at http://stackoverflow.com/a/32706275/14731. In my own experiments, it seems that sorting order corresponds to Unicode code point comparison with special handling for punctuation. Meaning, it doesn't look like they use the normal Unicode collation algorithm. – Gili Sep 22 '15 at 02:25
  • I am actually finishing up my research now on those 2 related questions. I just needed to find something concrete regarding the variation of the apparent rule. I will post on the "Why does SQL_Latin1_General_CP1_CI_AS sort" question in a moment. – Solomon Rutzky Sep 22 '15 at 02:46
  • @srutzky I have marked this answer as accepted because I have implemented "word sort" according to the rules mentioned above and I can confirm that the resulting ordering matches that of SQL Server perfectly. – Gili Sep 25 '15 at 15:49
  • What collations have you tested against on SQL Server? and what other linguistic "issues" have you tested with? Just curious. I am building up test cases for your other question. – Solomon Rutzky Sep 25 '15 at 15:58
  • @srutzky I only tested against unicode strings with the `SQL_Latin1_General_CP1_CI_AS` collation. My testcase verifies the sorting order for: 1. `"co_op", "coat", "comb", "coop", "co-op", "cork", "went", "were", "we're"` 2. `"a-bc", "abd"` 3. `"_sAsimov", "_S-Labsql"` 4. `"##MS_PolicyTsqlExecutionLogin##", "_gaia", "KARMALAB\\_sAsimov", "KARMALAB\\_S-Labsql", "KARMALAB\\_s-PCA", "KARMALAB\\_s-PCS", "KARMALAB\\_SuccService"` and 5. `"DBModule-common", "DBModule-commonUpdateTime"`. In each case, the sorting order is expected to remain unchanged (meaning, the list is in the expected order). – Gili Sep 25 '15 at 16:52