30

I know this issue has been noted before, more or less concisely, but I still create this new thread because I ran into the issue again when writing a unit test.

The default string comparison (that is the culture-dependent case-sensitive comparison that we get with string.CompareTo(string), Comparer<string>.Default, StringComparer.CurrentCulture, string.Compare(string, string) and others) violates transitivity when the strings contain hyphens (or minus signs, I am talking about plain U+002D characters).

Here is a simple repro:

static void Main()
{
  const string a = "fk-";
  const string b = "-fk";
  const string c = "Fk";

  Console.WriteLine(a.CompareTo(b));  // "-1"
  Console.WriteLine(b.CompareTo(c));  // "-1"
  Console.WriteLine(a.CompareTo(c));  // "1"

  var listX = new List<string> { a, b, c, };
  var listY = new List<string> { c, a, b, };
  var listZ = new List<string> { b, c, a, };
  listX.Sort();
  listY.Sort();
  listZ.Sort();
  Console.WriteLine(listX.SequenceEqual(listY));  // "False"
  Console.WriteLine(listY.SequenceEqual(listZ));  // "False"
  Console.WriteLine(listX.SequenceEqual(listZ));  // "False"
}

In the upper part we see how transitivity fails. a is less than b, and b is less than c, yet a fails to be less than c.

This goes against the documented behavior of Unicode collation which states that:

... for any strings A, B, and C, if A < B and B < C, then A < C.

Now sorting a list with a, b and c is exactly like trying to rank the hands of "Rock", "Paper" and "Scissors" in the well-known intransitive game. An impossible task.

The last part of my code sample above shows that the result of sorting depends on the initial order of the elements (and there are no two elements in the list which compare "equal" (0)).

Linq's listX.OrderBy(x => x) is also affected, of course. This should be a stable sort, but you get strange results when ordering a collection containing a, b and c together with other strings.

I tried this with all the CultureInfos on my machine (since this is a culture-dependent sort), including the "invariant culture", and each and every one has the same problem. I tried this with the .NET 4.5.1 runtime, but I believe older versions have the same bug.

Conclusion: When sorting strings in .NET with the default comparer, results are unpredictable if some strings contain hyphens.

What changes were introduced in .NET 4.0 that caused this behavior?

It has already been observed that this behavior is inconsistent across different versions of the platform: in .NET 3.5, strings with hyphens can be reliably sorted. In all versions of the framework, calling System.Globalization.CultureInfo.CurrentCulture.CompareInfo.GetSortKey provides unique DeyData for these strings, so why aren't they sorted correctly?

Community
  • 1
  • 1
Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
  • 17
    This question appears to be off-topic because it is more of a rant than an answerable question. – Jaime Torres Apr 15 '14 at 15:28
  • 7
    +1, I've come across same issue before in SO, I honestly don't know why but to solve that you can change the sort rules to `StringComparison.Ordinal`. – Sriram Sakthivel Apr 15 '14 at 15:29
  • 2
    and here's the link to the [aforementioned question](http://stackoverflow.com/questions/19370734/sortedlist-sorteddictionary-weird-behavior/19371082#19371082). Interestingly you have a comment in my answer as well :-D – Sriram Sakthivel Apr 15 '14 at 15:45
  • See http://msdn.microsoft.com/en-us/library/dd465121.aspx – mellamokb Apr 15 '14 at 15:45
  • 4
    Voting to close. If there is a string comparer with the desired behavior then what is the problem? – paparazzo Apr 15 '14 at 16:26
  • 14
    `Question: When will .NET fix their broken default comparer for strings?` --> can't be answered. `How can keeping this behavior be better than creating a consistent behavior?` --> opinion based. This is not an appropriate question for SO. – Jon B Apr 15 '14 at 17:09
  • 2
    @Blam There is no string comparer with the "desired behavior". I know the `Ordinal` comparer is different, but it sorts characters according to their codepoint values in Unicode, not according to usual rules in English (or other languages). I clearly describe a bug in the *default* comparer. The fact that other comparers (with completely different properties) exist does not "justfy" a bug in the default comparer. Is the default comparer officially retracted or deprecated? No. – Jeppe Stig Nielsen Apr 15 '14 at 17:13
  • 4
    I agree with the suggestion close. The explicit question is dependent on MS internal priorities. – Shawn C Apr 15 '14 at 17:39
  • 12
    @JeppeStigNielsen So you have described a bug. SO cannot answer the question as to when it will be fixed. That is a question for Microsoft. – paparazzo Apr 15 '14 at 18:11
  • 5
    I voted to reopen this question, I am not sure if it is a bug or not, but it is an interesting question with a Minimal, Complete, and Verifiable example. There are people on SO who could definitely answer that. Even if it is a bug, there have been [instances](http://stackoverflow.com/questions/6471527/c-sharp-compiler-bug-object-initializer-syntax-used-for-write-only-property-in) where it was clarified on SO and then reported to Microsoft. – Habib Apr 15 '14 at 18:40
  • @HansPassant Do you realize that substituting one or two `\u002D` in the above program with `\u2212` or/and `\u2010` still gives the bug? I wanted no surprise, this time I used unambiguous Unicode characters, but got surprised? – Jeppe Stig Nielsen Apr 15 '14 at 20:06
  • 1
    `The CompareTo method was designed primarily for use in sorting or alphabetizing operations. It should not be used when the primary purpose of the method call is to determine whether two strings are equivalent. To determine whether two strings are equivalent, call the Equals method.` This I found in http://msdn.microsoft.com/en-us/library/35f0x18w.aspx . have you check it? – Bharadwaj Apr 16 '14 at 14:06
  • 1
    @Bharadwaj Irrelevant. For the strings in question, `CompareTo` considers them different, just as `Equals`. I was not mis-using `CompareTo` to test for equality. There is no problem with strings (such as `"Strasse"` and `"Straße"`) that the culture-dependent comparison considers equal, if that was what you had in mind! – Jeppe Stig Nielsen Apr 16 '14 at 14:32
  • 1
    An interesting observation: in .NET 3.5, the outputs are `1,-1,1` and `True, True, True`. It appears between 3.5 and 4.0 there were already "breaking" changes to the framework. – nicholas Apr 16 '14 at 17:49
  • @nicholas Very interesting indeed. It sounds like the string comparer in 3.5 had a "good" transitive ordering of these three strings, namely `a ⊳ c ⊳ b`. I didn't know they changed anything in the comparer. Maybe they were trying to fix other (known) issues with comparing strings with hyphens, and then introduced this problem in the process? – Jeppe Stig Nielsen Apr 17 '14 at 08:32
  • @Habib It sure looks to me like quite a glaring bug, and an easily reproducible one at that. And the only way it has any chance of getting it fixed is to report it straight to Microsoft, not chat about it on SO. – JLRishe Apr 25 '14 at 04:22
  • It's not a bug. And the question is answerable also. – Jani Hyytiäinen Apr 25 '14 at 15:56
  • @JaniHyytiäinen: Can you describe any situation in which the behavior as it is exists would be preferable to any transitive implementation which would only report `x>y` if, for the present implementation, there would be some sequence of x1, x2, y1, y2, etc. `x >= x1 >= x2 ... xn > y1 >= y2 ... yn > y`? – supercat Nov 07 '14 at 15:31

1 Answers1

3

Microsoft Connect Discussion Here is some code to workaround:

static int CompareStringUsingSortKey(string s1, string s2)
{
    SortKey sk1 = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(s1);
    SortKey sk2 = CultureInfo.InvariantCulture.CompareInfo.GetSortKey(s2);
    return SortKey.Compare(sk1, sk2);
}
Kevin Cook
  • 1,922
  • 1
  • 15
  • 16
  • Quote from Microsoft on 2014 May 1: *"We are tracking to fix this issue on Windows 8 in the near future. also this issue should be fixed in the future Windows release too."* – BlueRaja - Danny Pflughoeft May 16 '14 at 18:14
  • 1
    @BlueRaja-DannyPflughoeft That was actually **2013** May 1. – Jeppe Stig Nielsen May 16 '14 at 19:43
  • 1
    @BlueRaja-DannyPflughoeft I checked (tested) further, and the bug as described in the Connect entry linked by the answer, has actually been solved with newest version of Windows (8.1) and .NET. However, the problem from the question with `a`, `b` and `c` still persists. However, the workaround with using `SortKey` also works for that problem. So maybe Microsoft's fix is only partial. – Jeppe Stig Nielsen May 19 '14 at 14:02
  • 1
    That comparison method is going to be extremely slow. It might be made somewhat better by using a `ConditionalWeakTable` to store the sort key associated with each string object; if multiple string objects contain the same characters, it might be worthwhile to also use some sort of weak hash map; I'm not sure how the cost of looking up a string in a weak hash map would compare with the cost of making a new sort key for it, however. – supercat Nov 07 '14 at 14:29
  • I just did a quick test; it seems that generating the CompareInfo with each comparison is a little more than 50x slower than using OrdinalIgnoreCase. Using a `ConditionalWeakTable` for lookup reduces that to being about 5x slower. If one stores CompareInfo objects rather than strings, performance is comparable to using OrdinalIgnoreCase. – supercat Nov 07 '14 at 15:26