2

I am seeing some very odd behavior in the way that .net sorts strings when using List.Sort.

Here's what I mean. This example conforms with what I believe to be the correct ordering of these special characters lexicographically:

public void sortOrder()
{
    var list = new List<string> {"_", "-"};
    list.Sort();
    Console.WriteLine("Output: " + string.Join(", ", list));  
}

And I get -, _ as a result which I believe to be correct. However when I do the following:

public void sortOrder2()
{
    var list = new List<string> {"x-amz-meta-file-number", "x-amz-meta-file_type"};
    list.Sort();
    Console.WriteLine("Output: " + string.Join(", ", list));
}

I get x-amz-meta-file_type, x-amz-meta-file-number which I'm not expecting based off of my first test.

Anyone have a clue why .net would sort these strings differently?

PatTheGamer
  • 471
  • 4
  • 17
  • 1
    Short answer: collation is *weird*. There are a number of oddities like this. – Eric Lippert Sep 29 '16 at 17:49
  • Weird this is in some java code we have we get the correct sorting... – PatTheGamer Sep 29 '16 at 17:51
  • 1
    For more examples of how collation doesn't behave as you might expect, see http://stackoverflow.com/questions/492799/difference-between-invariantculture-and-ordinal-string-comparison – Eric Lippert Sep 29 '16 at 17:52
  • Even stranger: Looking in .net, Sort() calls Array.Sort(this._items, index, count, comparer); which calling it `Array.Sort(list.ToArray(), 0, list.ToArray().Count(), null);` produces the correct output. – Dispersia Sep 29 '16 at 17:53
  • 1
    The slightly longer answer is: with no argument to Sort, it tries to sort *as a person would*, and a person would sort both "a-b" and "a_b" before both "a-c" and "a_c". If that's not acceptable to you, pass `StringComparer.Ordinal` as the argument to `Sort`. – Eric Lippert Sep 29 '16 at 17:56
  • @Alexei I already removed that comment, it was short-sighted but it wouldn't be the first time someone copied a string including invisible hyphens or "fancy" hyphens off some web page directly into their code. But I copied (haha) and inspected the strings, and this wasn't the case. – CodeCaster Sep 29 '16 at 17:57
  • @CodeCaster: Indeed, a plausible guess. Often the invisible character is a byte order mark or some such. – Eric Lippert Sep 29 '16 at 17:58
  • A more intuitive demo of why the sorting results are sensible: Try sorting { "Mark-b","Mark_a","Mark_c","Mark-d" }. If those were names of people, the alphabetical ordering would be way more important than the symbol ordering. – Brian Sep 29 '16 at 20:55

0 Answers0