18

I was looking at a question that was talking about a bad implementation of the Fisher-Yates shuffling algorithm and I was perplexed that there was a bias when implemented incorrectly.

The two algorithms are these:

private Random _random = new Random();

public int[] FisherYates(int[] source)
{
    int[] output = source.ToArray();
    for (var i = 0; i < output.Length; i++)
    {
        var j = _random.Next(i, output.Length);
        (output[i], output[j]) = (output[j], output[i]);
    }
    return output;
}

public int[] FisherYatesBad(int[] source)
{
    int[] output = source.ToArray();
    for (var i = 0; i < output.Length; i++)
    {
        var j = _random.Next(0, output.Length);
        (output[i], output[j]) = (output[j], output[i]);
    }
    return output;
}

A really subtle different, but enough to cause a massive bias.

Good implementation:

Good Fisher-Yates

Bad implementation:

Bad Fisher-Yates

To be clear about these plots, I start with the numbers 0 to 99, create 10_000_000 shuffles using whichever algorithm and then I average the values in each of the shuffles to get a single set of figures. If the shuffle is trying random then all 100 figures would belong to the same normal distribution.

Now, that's all fine, but I thought I'd check to see if these methods produce valid results:

public int[] OrderByRandomNext(int[] source) => source.OrderBy(x => _random.Next()).ToArray();

public int[] OrderByRandomNextDouble(int[] source) => source.OrderBy(x => _random.NextDouble()).ToArray();

Both nice a neat, but are they a fair shuffle?

OrderByRandomNext:

OrderByRandomNext

OrderByRandomNextDouble:

OrderByRandomNextDouble

Notice that the 1 and the 100 figures are significantly lower in each?

Well, I thought it might be an artefact of how OrderBy works. So I tested it with another random number generator - one brought to use from Eric Lippert in his improving Random series.

public int[] OrderByBetterRandomNextDouble(int[] source) => source.OrderBy(x => BetterRandom.NextDouble()).ToArray();

public static class BetterRandom
{
    private static readonly ThreadLocal<RandomNumberGenerator> crng =
        new ThreadLocal<RandomNumberGenerator>(RandomNumberGenerator.Create);

    private static readonly ThreadLocal<byte[]> bytes =
        new ThreadLocal<byte[]>(() => new byte[sizeof(int)]);

    public static int NextInt()
    {
        crng.Value.GetBytes(bytes.Value);
        return BitConverter.ToInt32(bytes.Value, 0) & int.MaxValue;
    }

    public static double NextDouble()
    {
        while (true)
        {
            long x = NextInt() & 0x001FFFFF;
            x <<= 31;
            x |= (long)NextInt();
            double n = x;
            const double d = 1L << 52;
            double q = n / d;
            if (q != 1.0)
                return q;
        }
    }
}

Well, here's the chart:

BetterRandom

There's no bias!

Here's my code to generate the data (run in LINQPad):

void Main()
{
    var n = 100;
    var s = 1000000;

    var numbers = Enumerable.Range(0, n).ToArray();

    var algorithms = new Func<int[], int[]>[]
    {
        FisherYates,
        OrderByRandomNext,
        OrderByRandomNextDouble,
        OrderByBetterRandomNextDouble,
    };

    var averages =
        algorithms
            .Select(algorithm =>
                Enumerable
                    .Range(0, numbers.Length)
                    .Select(x =>
                        Enumerable
                            .Range(0, s)
                            .Select(y => algorithm(numbers))
                            .Aggregate(0.0, (a, v) => a + (double)v[x] / s))
                    .ToArray())
            .Select(x => new
            {
                averages = x,
                distribution = Accord.Statistics.Distributions.Univariate.NormalDistribution.Estimate(x.Skip(1).SkipLast(1).ToArray()),
                first = x.First(),
                last = x.Last(),
            })
            .Select(x => new
            {
                x.averages,
                x.distribution,
                x.first,
                x.last,
                first_prob =x.distribution.DistributionFunction(x.first),
                last_prob = x.distribution.DistributionFunction(x.last),
            })
            .ToArray();

    var d = 

    averages.Dump();
}

private Random _random = new Random();

    public int[] FisherYates(int[] source)
    {
        int[] output = source.ToArray();
        for (var i = 0; i < output.Length; i++)
        {
            var j = _random.Next(i, output.Length);
            (output[i], output[j]) = (output[j], output[i]);
        }
        return output;
    }

public int[] OrderByRandomNext(int[] source) => source.OrderBy(x => _random.Next()).ToArray();

public int[] OrderByRandomNextDouble(int[] source) => source.OrderBy(x => _random.NextDouble()).ToArray();

    public int[] OrderByBetterRandomNextDouble(int[] source) => source.OrderBy(x => BetterRandom.NextDouble()).ToArray();

    public static class BetterRandom
    {
        private static readonly ThreadLocal<RandomNumberGenerator> crng =
            new ThreadLocal<RandomNumberGenerator>(RandomNumberGenerator.Create);

        private static readonly ThreadLocal<byte[]> bytes =
            new ThreadLocal<byte[]>(() => new byte[sizeof(int)]);

        public static int NextInt()
        {
            crng.Value.GetBytes(bytes.Value);
            return BitConverter.ToInt32(bytes.Value, 0) & int.MaxValue;
        }

        public static double NextDouble()
        {
            while (true)
            {
                long x = NextInt() & 0x001FFFFF;
                x <<= 31;
                x |= (long)NextInt();
                double n = x;
                const double d = 1L << 52;
                double q = n / d;
                if (q != 1.0)
                    return q;
            }
        }
    }

Here's the data that I generated:

distribution                                             | first              | last               | first_prob             | last_prob            
-------------------------------------------------------- | ------------------ | ------------------ | ---------------------- | ---------------------
N(x; μ = 49.50267467345823, σ² = 0.0008896228453062147)  | 49.505465999987585 | 49.49833699998965  | 0.5372807100387846     | 0.44218570467529394  
N(x; μ = 49.50503062243786, σ² = 0.0009954477334487531)  | 49.36330799998817  | 49.37124399998651  | 3.529550818615057E-06  | 1.115772521409486E-05
N(x; μ = 49.505720877539765, σ² = 0.0008257970106087029) | 49.37231699998847  | 49.386660999990106 | 1.7228855271333998E-06 | 1.712972513601141E-05
N(x; μ = 49.49994663264188, σ² = 0.0007518765247716318)  | 49.50191999998847  | 49.474235999989205 | 0.5286859991636343     | 0.17421285127499514  

Here's my question. What's up with System.Random and the bias it introduces?

Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • I don't understand how `FisherYatesBad` could produce that nicely ordered parabolic result. I've used that algorithm and it shuffled stuff as expected. – Andrew Jun 08 '21 at 13:54
  • 3
    I think it would be useful to describe what the plots are actually plotting, i.e. label axises etc. – JonasH Jun 08 '21 at 14:03
  • 3
    Which version of .NET are you using? The algorithm for `Random()` without a seed [was changed recently](https://github.com/dotnet/runtime/pull/47085) and might be applied to newer .NET versions. See also: https://github.com/dotnet/runtime/issues/23198 . – Peter O. Jun 08 '21 at 16:23
  • Is the problem that `first_prob` and `last_prob` are too small in the two middle rows? I've run this locally (with a smaller `s` as it takes a very long time) and I get values like 0.72&0.33 for `OrderByRandomNext` and .82&.68 for `OrderByBetterRandomNextDouble`, although they change across runs. – Andrew Jun 08 '21 at 20:45
  • @Andrew - I've provided two implementations of Fisher-Yates - one correct and one bad. The bad one does indeed produce what looks like a parabolic result. You see the parabolic effect clearly after generating 10 million samples. – Enigmativity Jun 08 '21 at 22:37
  • @JonasH - I took the numbers 0 to 99 and then shuffled them. That produces 100 numbers that, in theory, should have an equal chance of appearing in any position. The theoretical average is 49.5. I then compute the average value in each position and I've plotted those. – Enigmativity Jun 08 '21 at 22:40
  • @PeterO. - I've used .Net Framework 4.7.2 and .Net 5.0.6. Both have the same bias. – Enigmativity Jun 08 '21 at 22:44
  • @Andrew - You need to run 10 million or so as the bias doesn't show clearly before then. It's the averaging that removes the bias. The chances of the first and last values, when using `System.Random`, appearing in the normal distribution is represented by the `first_prob` and `last_prob` values. They are both exceedingly unlikely to belong to the standard distribution defined by the internal values of each array. – Enigmativity Jun 08 '21 at 22:46
  • Do you think the problem is actually in `Random` or the fact that you're not matching the expectations of `OrderBy`? – Damien_The_Unbeliever Jun 09 '21 at 10:57
  • @Damien_The_Unbeliever - I initially thought that `OrderBy` might have been the culprit, but that's why I also tried with `BetterRandom`. Since that removed the bias I figured that `OrderBy` behaves properly. – Enigmativity Jun 09 '21 at 11:05
  • 1
    Good question! As noted in the given answer, there are a lot of sources of bias in the default implementation of `Random`. A particularly vexing source of bias is that there are 2^226 possible shuffles but only 2^32 possible seeds, so the vast, vast majority of possible shuffles are never generated. – Eric Lippert Jun 09 '21 at 17:47

1 Answers1

10

The default RNG in .NET up to (and including) .NET 5 has known bias and performance issues, most documented here https://github.com/dotnet/runtime/issues/23198:

  • A typo in the implementation of Donald E. Knuth's subtractive random number generator with unknown practical effects.
  • A different modulo (2^32-1 instead of a power of two) with unknown practical effects.
  • Next(0, int.MaxValue) has heavy bias.
  • NextDouble() only produces 2^31 possible values, where it could pick from approx. 2^62 distinct values.

This is why .NET 6 implemented a better algorithm (xoshiro256**). You will get this better RNG when you instantiate a new Random() instance without a seed. This is described in https://github.com/dotnet/runtime/pull/47085. Unfortunately, it's not easy to replace the old RNG when a seed is provided since people might rely on the behavior of the current, biased RNG.

Even though xoshiro256** has some documented flaws (and a rebuttal) as well, I found it to work pretty well for my purposes. I have copied the improved implementation from .NET 6 and use that.

Side-note: LINQ queries are lazily evaluated (a.k.a. "deferred execution"). If you use a RNG in the .OrderBy lambda, you might get confusing results if you iterate multiple times, because the order might change every time. Some sorting algorithms rely on the fact that elements won't suddenly change their relative order to work correctly. Returning inconsistent ordering values will break such sorting algorithms. Sure, todays OrderBy implementation in LINQ-to-Objects works fine, but there's no documented guarantee that it has to work with "randomly"-changing values. A reasonable alternative is .OrderBy(e => HashCode.Combine(0x1337, e)).

Simon
  • 1,814
  • 20
  • 37
  • 1
    A good answer, but I disagree that the `HashCode.Combine` alternative is reasonable. Why do you think so? – Enigmativity Jun 09 '21 at 11:21
  • 1
    I would still like to know why the first and last numbers are affected like they are. I agree this is a smoking gun, but it's still not proof of the error. – Enigmativity Jun 09 '21 at 11:31
  • `HashCode.Combine` tries to distribute bits slightly, and will produce new values every program invocation. In my tests I didn't see biased averages. If you want higher certainty, you should of course use a cryptographic hash and a better seed. – Simon Jun 09 '21 at 11:34
  • If you want proof, you'd have to analyze the behavior of Knuth's RNG with different modulus and configuration. That's out of my league, sorry. – Simon Jun 09 '21 at 11:36
  • 1
    Fair enough. It's a bit too mad professor level for me too. – Enigmativity Jun 09 '21 at 11:43
  • 3
    For those still not using .Net 6 (because it's not yet released :P), one can use the [XoshiroPRNG.Net](https://www.nuget.org/packages/XoshiroPRNG.Net/) package. It provides a performant implementation of several PRNGs in the "Xoshiro family". I am the maintainer of that package, and if you need an algo I haven't coded, feel free to ask. – pepoluan Jun 09 '21 at 12:05
  • 3
    I think there might be some misunderstanding here about the practice of producing a shuffle by providing a random key. Though as the question notes, certain choices of RNG can produce bias, the fundamental technique of "associate a random key with each item and sort by that key" is fine in LINQ. The implementation generates each key exactly once and stores it; the key generating lambda is NOT called every time that the item is encountered during the sort. – Eric Lippert Jun 09 '21 at 17:34
  • What is NOT fine is the technique of "sort with a comparison where the comparison operation randomly says "bigger", "smaller" or "equal" for arbitrary pairs. A comparison operator is required to produce a total order: that is, A==B implies B==A, AA, and most commonly violated, transitivity works. If A – Eric Lippert Jun 09 '21 at 17:37
  • 2
    All that said, the more general point made in this answer is worth keeping in mind. Executing a query twice can unexpectedly produce different results, either because the underlying collection changed or because the lambdas in the query are impure. If your program logic relies upon queries producing the same results twice, then you need to do work to ensure that you are not in one of those situations. – Eric Lippert Jun 09 '21 at 17:41