3

I sometimes shuffle lists or arrays using OrderBy (item => R.NextDouble()), where Random R is initialized elsewhere.

Now this clearly is a hack, though the one recommended everywhere on this site, and it does work very nicely in practice.

However, it was also pointed out that this makes the assumption that the sorting algorithm does not get confused by changing values of a single item and e.g. enters an infinite loop.

My question is if there is some sort of implicit or explicit guarantee that this won't happen. I could not find anything about it in MSDN.

While there are very good, specialized O(n) algorithms for this purpose, I don't always want to search and copy-paste them in little side projects (we are not talking about prod here). They would obviously be the right solution.

Note that performance is not an issue at all. Also, there is no need for good or secure randomness (a cryptographic library must be used in that case) - this is just about shuffling stuff up a bit.

mafu
  • 31,798
  • 42
  • 154
  • 247
  • This is not a dupe of http://stackoverflow.com/questions/1287567, as this question is about the continued functioning of the method (i.e. does not loop or crash) instead. – mafu Sep 28 '15 at 10:15

3 Answers3

4

This is not guaranteed to work according to the documentation.

I have peeked into the internals. The sorting algorithm first captures all sort keys into a temporary array and then never calls them again. So in the current implementation this is safe. A new .NET version or even a patch can invalidate this technique.

You might argue that for compatibility reasons this will never change and the probably of that being true is like 95% (source: me). Microsoft has very high compat standards.

Another problem is that if you generate the same value for two keys they will not be positioned randomly. Instead, they will be positioned based on internals of the sorting algorithm which might be deterministic. For example, they might never switch their relative order.

I would only rely on this in short-term throw-away projects.

Another hack in this spirit is to sort by Guid.NewGuid(). That saves you another line of code and has no seeding problems. On the other hand guids might have a pattern or even be sequential. Also very problematic.

I would fail any of this in a code review of production code.

usr
  • 168,620
  • 35
  • 240
  • 369
  • Thank you for pointing out the temporary array. I would guess that vendors will not remove it in the future, as many people are abusing random (yay to us!), but that is just speculation. – mafu Sep 28 '15 at 10:32
  • `OrderBy` is documented to use a stable sort, therefore the order is always deterministic for equal sorting keys. – Lucas Trzesniewski Sep 28 '15 at 10:33
  • @LucasTrzesniewski right, that's terrible. I also don't understand why they would make it stable. This property is never useful in practice, especially since we have ThenBy. API mistake. – usr Sep 28 '15 at 10:34
  • @LucasTrzesniewski close, `Enumerable.OrderBy` is documented as being stable, other `OrderBy` implementations are not. P-Linq's `OrderBy` is certainly not stable. – Jon Hanna Sep 28 '15 at 10:35
  • It's not that terrible, a stable quicksort is not much worse in performance than an unstable quicksort, and certainly cheaper than using `ThenBy` to force stability when needed. – Jon Hanna Sep 28 '15 at 10:36
  • @JonHanna you're right, I assumed we were talking about `Enumerable` here but it's important to point it out. – Lucas Trzesniewski Sep 28 '15 at 10:36
  • @usr no that's not terrible. I've used the stable property of the sorting algorithm a lot of times, it's *very* useful. – Lucas Trzesniewski Sep 28 '15 at 10:37
  • @JonHanna I have never in my life needed stability. Sorting in multiple passes is a code smell. Use ThenBy. The sort can be stable, sure, but don't document it. Maybe we later want to switch in a new algorithm such as TimSort. That's now barred. – usr Sep 28 '15 at 10:37
  • If it wasn't documented, then people who did need it would have to use the expense of `ThenBy`. – Jon Hanna Sep 28 '15 at 10:38
  • @JonHanna true, those 0,01% of the cases (I mean that literally) need to use something else. Copy an algo from the web maybe. If perf is important you will not use Enumerable.* anyway. The 99% care about perf more than about stability. – usr Sep 28 '15 at 10:39
  • I can't see where stability hurts here (and I've looked at possible improvements to the method quite a lot, including testing unstable algorithms), unless you are going parallel (in which case you aren't using Enumerable) or for very particular inputs. If it's very unlikely to change, then documenting the stable effect gives the benefit to those who do need it. – Jon Hanna Sep 28 '15 at 11:00
  • I found a comment by Eric Lippert regaring collisions: "There are far, far greater problems than that source of bias. There are only four billion possible seeds to Random, which is far, far less than the number of possible shuffles of a moderately sized set. Only a tiny fraction of the possible shuffles can be generated. That bias dwarfs the bias due to accidental collisions." – mafu Sep 28 '15 at 11:11
  • However, I think this would be an easily mitigated or possibly even ignored problem in this very casual use case. – mafu Sep 28 '15 at 11:12
  • @JonHanna you looked at speeding it up? That's great news. Did you look at TimSort? Other platforms seem to love it. I have notice that the BCL sorting algorithms seem to change with each major revision and become more sophisticated. – usr Sep 28 '15 at 11:36
  • @usr TimSort is great, but its more expensive in memory use which means it can be worse when that becomes a factor, and the impact of that is hard to predict (how many real-world cases would be better, would be worse, would suddenly have memory exceptions). I'm not ruling it out, but it would make it harder to sell a PR with that in it. I'm likely going to propose a local stack of indices being used rather than standard recursion as this gives a nice boost in initial tests without altering the basic algorithm. My main focus at the moment though is changing the order of … – Jon Hanna Sep 28 '15 at 11:50
  • @usr `OrderBy(…).First()` from O(n log n) to O(n), `OrderBy(…).Skip(j).Take(k)` from O(n log n) + O(j) + O(k) to O(n + k log k) + O(k), `OrderBy(…).ElementAt(…)` from O(n log n) to O(n), and so on as these are both common combinations and big potential wins. There can always of course be some negative impact that I've failed to see that will prevent acceptance of the PR. – Jon Hanna Sep 28 '15 at 11:54
  • @usr take a look at https://github.com/dotnet/corefx/pull/2401 if you are interested. – Jon Hanna Sep 28 '15 at 12:14
  • @JonHanna that would be a cool improvement. I have corresponded with the JIT team a few month ago to see if we could get devirtualization and escape analysis in order to turn more simple LINQ operators into loops. I kind of doubt it will ever happen with RyuJIT. Maybe LLILC will finally be able to do that. These things are standard on the JVM. – usr Sep 28 '15 at 12:15
  • @JonHanna does sorting still have worst case quadratic behavior? That could lead to a vulnerability like the one where an attacker could provoke hash table collisions and overload the server. – usr Sep 28 '15 at 12:16
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/90790/discussion-between-jon-hanna-and-usr). – Jon Hanna Sep 28 '15 at 12:17
2

It's not guaranteed to continue working as well as it does.

It's highly unlikely that it would ever stop working as well as it does, because that change would require a change to the pre-computation of key values, which would be very expensive in many cases. As such, a change that broke it is unlikely to ever happen. It would also be an observable change in other ways, in terms of how often the selector is called. That sort of observable change isn't ruled out (I've had PRs to .NET Core that reduced how often selectors are called in other linq methods accepted) but the change would have to have a very strong benefit, especially since it would increase, rather than decrease, the number of calls.

It doesn't currently work very well, because Enumerable.OrderBy is (unlike other OrderBy methods in linq, such as that on Queryable or in PLinq) guaranteed to give a stable ordering (equal values always in their original order) which gives you a bias.

It is useful for quick-to-write random orderings where the quality of the shuffle need only be pretty good. For a more robust shuffle use something like:

public static class Shuffler
{
  public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, int seed)
  {
    return new ShuffledEnumerable<T>(source, seed);
  }

  public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source)
  {
    return new ShuffledEnumerable<T>(source);
  }

  private class ShuffledEnumerable<T> : IEnumerable<T>
  {
    private IEnumerable<T> _source;
    private int? _seed;

    public ShuffledEnumerable(IEnumerable<T> source)
    {
      _source = source;
    }

    public ShuffledEnumerable(IEnumerable<T> source, int seed)
      : this(source)
    {
      _seed = seed;
    }

    public IEnumerator<T> GetEnumerator()
    {
      Random rnd = _seed.HasValue ? new Random(_seed.GetValueOrDefault()) : new Random();
      T[] array = _source.ToArray();
      int count = array.Length;
      for (int i = array.Length - 1; i > 0; --i)
      {
        int j = rnd.Next(0, i + 1);
        if (i != j)
        {
          T swapped = array[i];
          array[i] = array[j];
          array[j] = swapped;
        }
      }
      return ((IEnumerable<T>)array).GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
      return GetEnumerator();
    }
  }
}

You can improve it further again if you need cryptographic quality of randomness by replacing the use of Random to a suitable cryptographic PRNG.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • Good point about the need for precomputed keys. -- I disagree that the stableness is a problem, as even considering the Birthday Phenomenon, `NextDouble` has a too large result space to suffer from this. – mafu Sep 28 '15 at 10:36
2

It's not guaranteed, although it works and most likely will not change. But if you wish to be 100% sure, you can guarantee that yourself by using explicit projection

enumerable.Select(item => new { Source = item, Order = R.NextDouble() }.OrderBy(item => item.Order).Select(item.Source)

which of course is much more verbose, but still easy for a quick and dirty approach.

Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
  • This is also known as [Schwartz sort](https://en.wikipedia.org/wiki/Schwartzian_transform). I find it a very elegant way to quickly (in terms of code) shuffle a small list. But I wonder why everybody here seems to use floating point. A random integer of the same bit size would be at least as fast if not faster and avoid all the floating-point headaches. – 5gon12eder Sep 28 '15 at 19:52
  • @5gon12eder Thank you for the info! I didn't know that (reinventing the wheel as usual :-) – Ivan Stoev Sep 28 '15 at 19:55