34

I have a list which I shuffle with the Python built in shuffle function (random.shuffle)

However, the Python reference states:

Note that for even rather small len(x), the total number of permutations of x is larger than the period of most random number generators; this implies that most permutations of a long sequence can never be generated.

Now, I wonder what this "rather small len(x)" means. 100, 1000, 10000,...

Henrik
  • 14,202
  • 10
  • 68
  • 91

4 Answers4

68

TL;DR: It "breaks" on lists with over 2080 elements, but don't worry too much :)

Complete answer:

First of all, notice that "shuffling" a list can be understood (conceptually) as generating all possible permutations of the elements of the lists, and picking one of these permutations at random.

Then, you must remember that all self-contained computerised random number generators are actually "pseudo" random. That is, they are not actually random, but rely on a series of factors to try and generate a number that is hard to be guessed in advanced, or purposefully reproduced. Among these factors is usually the previous generated number. So, in practice, if you use a random generator continuously a certain number of times, you'll eventually start getting the same sequence all over again (this is the "period" that the documentation refers to).

Finally, the docstring on Lib/random.py (the random module) says that "The period [of the random number generator] is 2**19937-1."

So, given all that, if your list is such that there are 2**19937 or more permutations, some of these will never be obtained by shuffling the list. You'd (again, conceptually) generate all permutations of the list, then generate a random number x, and pick the xth permutation. Next time, you generate another random number y, and pick the yth permutation. And so on. But, since there are more permutations than you'll get random numbers (because, at most after 2**19937-1 generated numbers, you'll start getting the same ones again), you'll start picking the same permutations again.

So, you see, it's not exactly a matter of how long your list is (though that does enter into the equation). Also, 2**19937-1 is quite a long number. But, still, depending on your shuffling needs, you should bear all that in mind. On a simplistic case (and with a quick calculation), for a list without repeated elements, 2081 elements would yield 2081! permutations, which is more than 2**19937.

Henrik
  • 14,202
  • 10
  • 68
  • 91
rbp
  • 43,594
  • 3
  • 38
  • 31
  • 2
    +1 for nicely explaining the topic and problem. Imho this should be the accepted answer. Oh, and I'd move the TD;DR to the top since most people getting scared by a body of text probably won't read that far down :-). – Joey Jun 17 '10 at 15:21
  • @Johannes: you needen't have deleted your answer :) Still, thanks! – rbp Jun 18 '10 at 12:39
  • @rdp: Well, it was kinda redundant now :-). You did a much better job at explaining it. – Joey Jun 18 '10 at 12:50
  • 1
    Please note there are somewhat less than 100! atoms in the universe. The approximate number IIRC is only 10^^70th. So, yeah. Don't worry about EXACTLY how random it is. – Kevin J. Rice Nov 20 '15 at 02:36
  • One may be tempted to switch to random.SystemRandom aka secrect.SystemRandom. However, if no new entropy arrives in /dev/urandom during the shuffling, it has an even shorter period according to the accepted answer of http://stackoverflow.com/questions/32139660/is-dev-urandom-suitable-for-simulation-purpose – Joachim Wagner May 19 '17 at 11:21
  • Let's say you have a list of 2081 elements and still want to get all permutations.. How would you achieve that? Could you use multiple independent `Random` instances and shuffle repeatedly? – Joschua Nov 30 '19 at 21:23
21

I wrote that comment in the Python source originally, so maybe I can clarify ;-)

When the comment was introduced, Python's Wichmann-Hill generator had a much shorter period, and we couldn't even generate all the permutations of a deck of cards.

The period is astronomically larger now, and 2080 is correct for the current upper bound. The docs could be beefed up to say more about that - but they'd get awfully tedious.

There's a very simple explanation: A PRNG of period P has P possible starting states. The starting state wholly determines the permutation produced. Therefore a PRNG of period P cannot generate more than P distinct permutations (and that's an absolute upper bound - it may not be achieved). That's why comparing N! to P is the correct computation here. And, indeed:

>>> math.factorial(2080) > 2**19937 - 1
False
>>> math.factorial(2081) > 2**19937 - 1
True
Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 2
    Thanks for the details. I think that the documentation for random.shuffle is currently a little too sparse. – Wolf Sep 07 '14 at 10:17
4

What they mean is that permutations on n objects (noted n!) grows absurdly high very fast.

Basically n! = n x n-1 x ... x 1; for example, 5! = 5 x 4 x 3 x 2 x 1 = 120 which means there are 120 possible ways of shuffling a 5-items list.

On the same Python page documentation they give 2^19937-1 as the period, which is 4.something × 10^6001 or something. Based on the Wikipedia page on factorials, I guess 2000! should be around that. (Sorry, I didn't find the exact figure.)

So basically there are so many possible permutations the shuffle will take from that there's probably no real reason to worry about those it won't.

But if it really is an issue (pesky customer asking for a guarantee of randomness perhaps?), you could also offload the task to some third-party; see http://www.random.org/ for example.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Joubarc
  • 1,206
  • 10
  • 17
  • 1
    Or 2081 as Johannes says. Guess I wasn't that far off then. – Joubarc Jun 17 '10 at 15:10
  • I was narrowing it down manually in Wolfram|Alpha since it wouldn't give me just a result for "x! > 2^19937-1". – Joey Jun 17 '10 at 15:12
  • 1
    I arrived at that with a quick loop testing for "math.factorial(i) >= 2**19937" :) – rbp Jun 17 '10 at 15:19
  • @rbp: I should really start giving my favorite scripting environment (PowerShell) some better math capabilities :-) – Joey Jun 17 '10 at 15:22
  • 2
    Or give it Python bindings, and use Python's stdlib! ;) – rbp Jun 17 '10 at 15:31
0

I want to answer this question because I faced the Mersenne Twister limitation. Let me explain. I had a list of 7200 items with a lot of repetitions that looks like this :

array = [0 for i in range(5920)] + [1 for i in range(1200)] + [2 for i in range(80)]

If you do the maths, it should be 7200! / (5920! * 1200! * 80!). However, it is pretty hard to compute and compare to 2080! because both numerator and denominator are huge. Anyway, with a bit of simplifications, one can come to the conclusion that IT IS bigger than 2080!. In addition, both random.shuffle() and random.sample() suffer the same limitation.

Here is a solution I came up with. I'd appreciate any criticism about this method because I'm not really good with statistics :

# Python 3.11.4
from random import Random
class Shuffle :
    def __init__(self, array : list[int]) -> None :
        self._array : list[int] = array
        self._seed : Random = Random()
        self.length : int = len(array)

    def choice(self) -> float :
        if self.length > 0 :
            pick : int = self._seed.choice(self._array)
            self._array.remove(pick)
            self.length -= 1
            return pick
        raise IndexError("No more item in array !")

This way, the array is shuffled with only length random generated number vs the crazy amount of permutations I was talking about.

To make things unambiguous, this class is meant to be used the following way :

array = [0 for i in range(5920)] + [1 for i in range(1200)] + [2 for i in range(80)]
shuffler = Shuffle(array)
shuffled_array = [shuffler.choice() for i in range(shuffler.length)]
del shuffler

So the IndexError would never be raised even if you call it several times, unless you iterate more than length.

I need such algorithm because random.choice() alone doesn't guarantee I still have the right amount of 0, 1 and 2 in the end.

I hope this could help people who try to bypass the Mersenne Twister RNG implementation for shuffling even though the initial question doesn't go that far.

Pehnny
  • 53
  • 5