Improving "shuffle" efficiency

Question

Right now, I am using the following code to create a Shuffle extension:

public static class SiteItemExtensions
{
    public static void Shuffle<T>(this IList<T> list)
    {
        var rng = new Random();
        int n = list.Count;
        while (n > 1)
        {
            n--;
            int k = rng.Next(n + 1);
            T value = list[k];
            list[k] = list[n];
            list[n] = value;
        }
    }
}

I am looking for a way more faster and efficient way to do this. Right now, using the Stopwatch class, it is taking about 20 seconds to shuffle 100,000,000 items. Does anyone have any ideas to make this faster?

Why are you shuffling 100 million items? Can you cut that number down somehow? — dlev, Sep 30 '11 at 02:11
@dlev -- I am only shuffling 100 million to test the efficiency of the algorithm. Its not a "real" list. — Icemanind, Sep 30 '11 at 02:20
@icemanind Ah, in that case, have you actually found shuffling to be the bottleneck in your code? For small lists of items, your current approach looks fine. — dlev, Sep 30 '11 at 02:34
@bdares: The algorithm given here does *not* produce a uniform distribution. A simple counting argument demonstrates that. There are only 2^32 possible sequences chosen by this algorithm because Random takes a 32 bit seed. But there are more than 2^32 possible shuffles when n exceeds 32. Therefore there are some shuffles that will not be produced by this algorithm, and therefore the distribution is not uniform. — Eric Lippert, Sep 30 '11 at 03:14
No, this is a Fisher Yates shuffle: http://en.wikipedia.org/wiki/Fisher-Yates_shuffle#The_modern_algorithm — Hans Passant, Sep 30 '11 at 03:32
@HansPassant I think Eric Lippert is right... I assumed that Random would generate a reasonably random sequence, but for our purposes it does not (unless I'm wrong and the internal state can hold more bits than the seed seems to indicate). — , Sep 30 '11 at 03:55
Eric must have been talking about the algorithm in this example: http://www.codinghorror.com/blog/2007/12/the-danger-of-naivete.html — Hans Passant, Sep 30 '11 at 04:01
@HansPassant: No, I'm talking about the algorithm presented here. The Knuth shuffle does not give an unbiased shuffle *if the source of randomness has insufficient entropy*. Since the source of randomness here has at best 32 bits of entropy, it cannot possibly give an unbiased shuffle of more than 32 items, since 32! > 2^32 — Eric Lippert, Sep 30 '11 at 04:32

Hans Passant · Accepted Answer · 2011-09-30T03:11:00.733

4

This highlights an aspect of modern computer design that's often overlooked. It can be made over 3 times faster by a silly change:

            int k = 0; rng.Next(n + 1);  // silly change

There are now more statements in the inner loop but it is much faster. What you are seeing is the effect of the CPU cache. This algorithm has very poor cache locality, the odds that the next element to be read from the array is already in the cache is very low. Which takes an expensive trip to the slower outer caches and the dead-slow memory bus. The odds that elements from the array needed later are loaded into the cache is very high. But the odds that they are still there when needed is very low, your list is simply too large the fit the cache.

Nothing can be done to fix this, it is inherent in the algorithm design. Using realistic list sizes is however an obvious solution. Running it 100,000 times on a list with a 1000 elements is 3 times as fast.

edited Sep 30 '11 at 03:11

answered Sep 30 '11 at 02:59

Hans Passant

922,412
146
1,693
2,536

Is `int k = 0; rng.Next(n + 1)` what you really intended there? – Paul Walls Sep 30 '11 at 03:08
So does the speed increase occur because the store instruction is using a cached value of 0 each time vs. a failed lookup in `rng.Next(n + 1)`? (I realize this has nothing to do with the OP's question, but I am really fascinated by this result.) – Paul Walls Sep 30 '11 at 03:32
It occurs because it now keeps using the same memory address instead of the random one. So the value is guaranteed to be in the fast cache. – Hans Passant Sep 30 '11 at 03:36
Ah... hence your comment about poor cache locality due to the size of the list. Thanks for explaining. – Paul Walls Sep 30 '11 at 03:40
Well, no, the poor cache locality of the algorithm is a given. The list size makes it deadly to perf. The silly version is just as fast as the regular version on a small list that fits the cache. – Hans Passant Sep 30 '11 at 03:49
@HansPassant -- I find this very interesting. I would like to keep this question open just a little longer to see if there are any more response, but so far, I like your answer. – Icemanind Sep 30 '11 at 14:59

score 3 · Answer 2 · edited May 23 '17 at 09:59

You have exceeded the capabilities of your CPU's cache and are spending most of the time waiting for RAM.

By progressively lowering the number of elements, I get the following results (on List<int>):

count      time (s)     slowdown
100000000  16.0429005   11.99215276436421
10000000   01.3377832   20.37312930505406
1000000    00.0656641   13.36837069158574
100000     00.0049119

Notice the big slowdown between between 10^6 and 10^7. I increased the number of elements by factor of 10, but the time increased by factor of 20. This is probably where my CPU no longer can fit most of the array into the 2nd (and last on my CPU) level of cache.

BTW, you can gain a second or two (but lose generality), by replacing IList<T> with List<T> in the method signature and avoiding the interface call penalty on []:

IList<T>:   16.0429005 s
List<T>:    14.3529349 s

For the record, under Visual C++ 2010, std::random_shuffle on std::vector<int> with 100000000 elements takes...

17.947 s

...so you are probably already as fast as you can be.

(NOTE: Both C# and C++ benchmarks were done under respective Release configurations, outside of the debugger.)

Improving "shuffle" efficiency

2 Answers2