17

If I have some list of strings contain all numbers and dashes they will sort ascending like so:

s = s.OrderBy(t => t).ToList();

66-0616280-000
66-0616280-100
66-06162801000
66-06162801040

This is as expected.

However, if the strings contain letters, the sort is somewhat unexpected. For example, here is the same list of string with trailing A's replacing the 0s, and yes, it is sorted:

66-0616280-00A
66-0616280100A
66-0616280104A
66-0616280-10A

I would have expected them to sort like so:

66-0616280-00A
66-0616280-10A
66-0616280100A
66-0616280104A

Why does the sort behave differently on the string when it contains letters vs. when it contains only numbers?

Thanks in advance.

BBauer42
  • 3,549
  • 10
  • 44
  • 81

2 Answers2

13

It's because the default StringComparer is culture-sensitive. As far as I can tell, Comparer<string>.Default delegates to string.CompareTo(string) which uses the current culture:

This method performs a word (case-sensitive and culture-sensitive) comparison using the current culture. For more information about word, string, and ordinal sorts, see System.Globalization.CompareOptions.

Then the page for CompareOptions includes:

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

("Small weight" isn't quite the same as "ignored" as quoted in Andrei's answer, but the effects are similar here.)

If you specify StringComparer.Ordinal, you get results of:

66-0616280-00A
66-0616280-10A
66-0616280100A
66-0616280104A

Specify it as the second argument to OrderBy:

s = s.OrderBy(t => t, StringComparer.Ordinal).ToList();

You can see the difference here:

Console.WriteLine(Comparer<string>.Default.Compare
    ("66-0616280104A", "66-0616280-10A"));
Console.WriteLine(StringComparer.Ordinal.Compare
    ("66-0616280104A", "66-0616280-10A"));
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I think this answer isn't really complete without the explanation of how the ordering differs. Saying how to fix it is good but its nice to know why the original failed which is what the question actually was (though has been explained by Andrei). – Chris Feb 19 '14 at 16:46
  • I agree, the combination of Andrei and Jon's answer is really the answer, wish I could select both. – BBauer42 Feb 19 '14 at 16:49
  • @BBauer42, I think you should select Jon's, so when other run into the same problem they will see how it can be solved. Not everyone cares for explanation, but everyone cares for solution. – Andrei Feb 19 '14 at 16:55
  • @Andrei: Hopefully my edit contains a sufficient explanation as well now :) – Jon Skeet Feb 19 '14 at 16:56
  • Thanks for the updated version, Jon. A much more complete answer. – Chris Feb 21 '14 at 12:16
5

Here is the remark from MSDN:

Character sets include ignorable characters. The Compare(String, String) method does not consider such characters when it performs a culture-sensitive comparison. For example, if the following code is run on the .NET Framework 4 or later, a culture-sensitive comparison of "animal" with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two strings are equivalent.

So it looks like you are experiencing this ignorable character case. If we assume that the - symbol has a very small weight in comparison, the results of the sorting look like this.

First case:

660616280000
660616280100
6606162801000
6606162801040

Second case:

66061628000A
660616280100A
660616280104A
66061628010A 

Which makes sense

Andrei
  • 55,890
  • 9
  • 87
  • 108
  • 1
    Very nice explanation of how culture-sensitivity affects this scenario. – Jon Senchyna Feb 19 '14 at 16:50
  • 1
    I'm not sure it's actually an *ignorable* character here, as I believe it's just a normal ASCII hyphen. I suspect it's the "small weight" aspect which comes into play - see my edited answer for details. – Jon Skeet Feb 19 '14 at 16:55
  • @JonSkeet, yeah, I see. Still not clear though when the hyphen does affect the culture-sensitive sorting. Small weight is still weight, and it should come in play somewhere. Are you aware of such examples? – Andrei Feb 19 '14 at 17:01
  • @Andrei: I suspect it means that "a-b" will come before "a--b", whereas if the characters were completely *ignored*, they would be sorted equally. Just a guess though :) – Jon Skeet Feb 19 '14 at 17:04