5

I thought that in .NET strings were compared alphabetically and that they were compared from left to right.

string[] strings = { "-1", "1", "1Foo", "-1Foo" };
Array.Sort(strings);
Console.WriteLine(string.Join(",", strings));

I'd expect this (or the both with minus at the beginning first):

1,1Foo,-1,-1Foo

But the result is:

1,-1,1Foo,-1Foo

It seems to be a mixture, either the minus sign is ignored or multiple characters are compared even if the first character was already different.

Edit: I've now tested OrdinalIgnoreCase and i get the expected order:

Array.Sort(strings, StringComparer.OrdinalIgnoreCase);

But even if i use InvariantCultureIgnoreCase i get the unexpected order.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • 1
    Where in the alphabet does a "-" fall? Probably has to do with hyphenated words dont get their order changed in the dictionary – Dan Drews Jun 25 '14 at 12:20
  • @DanDrews: Does it matter? If it comes first then both should come first. – Tim Schmelter Jun 25 '14 at 12:22
  • 2
    Try this `string.Compare("1", "-1Foo", StringComparison.InvariantCultureIgnoreCase)` vs. this `string.Compare("-1", "1Foo", StringComparison.InvariantCultureIgnoreCase)`. You'll get -1 both times. – Lasse V. Karlsen Jun 25 '14 at 12:23
  • 1
    He probably use `de-DE` culture as a current. – Soner Gönül Jun 25 '14 at 12:24
  • This was ranted about [here](http://stackoverflow.com/questions/23087995/string-comparison-and-sorting-when-strings-contain-hyphens) recently. `-` just doesn't *have* a position - it's "ignorable" as far as all of the comparison features work, and it's a problem at the OS level more than the .NET level. – Damien_The_Unbeliever Jun 25 '14 at 12:24
  • This is a bug / issue in the .NET framework - the output sort order depends on the order of the original set. Starting in .NET 4.0, string comparison does not maintain transitive consistency. Related: http://stackoverflow.com/questions/23087995/string-comparison-and-sorting-when-strings-contain-hyphens – nicholas Jun 25 '14 at 12:25
  • @TimSchmelter It would matter IF the "-" is ignored (I was trying to give a possible justification for it being ignored) – Dan Drews Jun 25 '14 at 12:25
  • 1
    This is not culture-specific, this is the same for all cultures in .NET. – Lasse V. Karlsen Jun 25 '14 at 12:27
  • I think it is explained here: http://stackoverflow.com/questions/21886555/unexpected-behavior-when-sorting-strings-with-letters-and-dashes – Vladimir Jun 25 '14 at 12:27
  • 1
    You should use `StringComparer.Ordinal`. – Vladimir Jun 25 '14 at 12:28
  • Please note that comparing `animal` with `ani-mal` will in fact return a sort order that switches if you switch which string has the hyphen, whereas the one I showed above with "1" vs. "-1Foo" does not, but perhaps hyphens in the start is special. Text handling is complex. – Lasse V. Karlsen Jun 25 '14 at 12:28
  • @usr: the result is the same if i use `Array.Sort(strings, StringComparer.InvariantCultureIgnoreCase );`. – Tim Schmelter Jun 25 '14 at 12:29
  • I tested all 356 cultures in .NET, they all produce -1 for both of the strings I posted above, so this is not culture-*specific*, but that the rules are governed by the text system in .NET (or the operating system), of that I'm pretty sure. – Lasse V. Karlsen Jun 25 '14 at 12:30
  • @TimSchmelter If you use `StringComparer.Ordinal`, you end up with "-1,-1Foo,1,1Foo" since that deals with that raw byte values – Daniel Kotin Jun 25 '14 at 12:32
  • @TimSchmelter:Java Does it Correctly.I am wondering based on what order the sorting is done.Ascii tables says that "-" has got value of 45,so the output must be -1,-1Foo,1,1Foo.Java Produced the Expected result. – Rangesh Jun 25 '14 at 12:32
  • Jon Skeet explains it here: http://stackoverflow.com/a/21886726/61697 - seems that if the string contains only digits then the hyphen is given a large weighting and the sort order is *as expected*. With strings containing non-digits the hyphen is given a smaller importance when sorting. – demoncodemonkey Jun 25 '14 at 12:32
  • I would be wary of throwing around "correctly" here, we don't know what the specification says for the relevant piece of code and why it says so. There may be textual rules that says the .NET/Windows handling is correct, it's just our expectations that are wrong. – Lasse V. Karlsen Jun 25 '14 at 12:33
  • try your luck and combinations over here.. https://dotnetfiddle.net/FAupKf – Sandip Jun 25 '14 at 12:36

2 Answers2

2

There is a small note on the String.CompareTo method documentation:

Notes to Callers:

Character sets include ignorable characters. The CompareTo(String) method does not consider such characters when it performs a culture-sensitive comparison. For example, if the following code is run on the .NET Framework 4 or later, a comparison of "animal" with "ani-mal" (using a soft hyphen, or U+00AD) indicates that the two strings are equivalent.

And then a little later states:

To recognize ignorable characters in a string comparison, call the CompareOrdinal(String, String) method.

These two statements seem to be consistent with the results you are seeing.

Community
  • 1
  • 1
John Koerner
  • 37,428
  • 8
  • 84
  • 134
  • 2
    But hyphen is not completely ignored. Try comparing `"animal"` with `"ani-mal"` and do the comparison both ways, you'll get a value that switches when you switch the values, whereas `"1"` vs. `"-1Foo"` and then switching the hyphen around does not. This alone is not the answer. – Lasse V. Karlsen Jun 25 '14 at 12:35
  • Also, even if i use [`InvariantCulture`](http://msdn.microsoft.com/en-us/library/system.globalization.cultureinfo.invariantculture.aspx) i get the same result but you've quoted: "when it performs a culture-sensitive comparison". By the way, i'm really using .NET 4, if that matters. Thanks – Tim Schmelter Jun 25 '14 at 12:37
2

Jon Skeet to the rescue here

Specifically:

The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

But adding the StringComparer.Ordinal makes it behave as you want:

string[] strings = { "-1", "1", "10", "-10", "a", "ba","-a" };      
Array.Sort(strings,StringComparer.Ordinal );
Console.WriteLine(string.Join(",", strings));
// prints: -1,-10,-a,1,10,a,ba

Edit:
About the Ordinal, quoting from MSDN CompareOptions Enumeration

Ordinal Indicates that the string comparison must use successive Unicode UTF-16 encoded values of the string (code unit by code unit comparison), leading to a fast comparison but one that is culture-insensitive. A string starting with a code unit XXXX16 comes before a string starting with YYYY16, if XXXX16 is less than YYYY16. This value cannot be combined with other CompareOptions values and must be used alone.

Also seems you have String.CompareOrdinal if you want the ordinal of 2 strings.

Here's another note of interest:

When possible, the application should use string comparison methods that accept a CompareOptions value to specify the kind of comparison expected. As a general rule, user-facing comparisons are best served by the use of linguistic options (using the current culture), while security comparisons should specify Ordinal or OrdinalIgnoreCase.

I guess we humans expect ordinal when dealing with strings :)

Community
  • 1
  • 1
Noctis
  • 11,507
  • 3
  • 43
  • 82
  • 1
    Thanks. But if i use `InvariantCulture` it should not compare by using any culture. If i've understood your quote correctly that should mean "String sort" which has no special cases like `-`. Does that mean the order is arbitrary/unpredictable with `-` if i don't use `Ordinal`? – Tim Schmelter Jun 25 '14 at 12:44