How can I shuffle a very large list stored in a file in Python?

Question

I need to deterministically generate a randomized list containing the numbers from 0 to 2^32-1.

This would be the naive (and totally nonfunctional) way of doing it, just so it's clear what I'm wanting.

import random
numbers = range(2**32)
random.seed(0)
random.shuffle(numbers)

I've tried making the list with numpy.arange() and using pycrypto's random.shuffle() to shuffle it. Making the list ate up about 8gb of ram, then shuffling raised that to around 25gb. I only have 32gb to give. But that doesn't matter because...

I've tried cutting the list into 1024 slices and trying the above, but even one of these slices takes way too long. I cut one of these slices into 128 yet smaller slices, and that took about 620ms each. If it grew linearly, then that means the whole thing would take about 22 and a half hours to complete. That sounds alright, but it doesn't grow linearly.

Another thing I've tried is generating random numbers for every entry and using those as indices for their new location. I then go down the list and attempt to place the number at the new index. If that index is already in use, the index is incremented until it finds a free one. This works in theory, and it can do about half of it, but near the end it keeps having to search for new spots, wrapping around the list several times.

Is there any way to pull this off? Is this a feasible goal at all?

This looks like an XY problem. Please explain the need for such a shuffled list. — Klaus D., Jul 01 '17 at 05:59
How about using `generators` to get list items on the fly. This way you won't have to exhaust memory. Also do you want the entire list at once? — void, Jul 01 '17 at 06:01
@KlausD. I'm experimenting with ideas in cryptography, and I'm wanting to make a "perfect" pseudorandom permutation. This requires having a lookup table of ~4 billion entries. — Daffy, Jul 01 '17 at 06:02
@s_vishnu I don't have a method of getting items on the fly in this way. Or at least I haven't thought of one. Do you have any suggestions? — Daffy, Jul 01 '17 at 06:03
have you tried writing this to a file and using shell script shuffle/random to write the content to another file ? — Vikash Singh, Jul 01 '17 at 06:04
If you are talking about `billions` of entries. Then creating and saving them might consume a lot of time and memory. Create them as and when you need using `generators` in Python. Read about them here https://www.fullstackpython.com/generators.html — void, Jul 01 '17 at 06:05
To create a cryptographically random permutation of 2^32 32bit integers you would need a random pool entropy of 2^36 bytes, most systems have a few kB. — Klaus D., Jul 01 '17 at 06:09
Generate random numbers between 0 and 2**32. Use a [bitarray](https://pypi.python.org/pypi/bitarray/) to keep track which numbers are already drawn. Keep generating numbers until you get an unused one and update the bitarray. The bitarray should take a bit more than 4 GB of memory. — agtoever, Jul 01 '17 at 06:22
You could generate a permutation of `2**32` integers by iterating through a complete period of a [linear congruential generator](https://en.wikipedia.org/wiki/Linear_congruential_generator) that has period `2**32`. Whether that is "random enough" depends on exactly how you are going to use the permutation. — Warren Weckesser, Jul 01 '17 at 06:47
`crypto.random` take roughly 2ms to compute a random number. For 2**32 numbers, it's about 100 days..... — B. M., Jul 01 '17 at 13:47

score 4 · Accepted Answer · answered Jul 01 '17 at 15:39

Computing all the values seems impossible, since Crypto compute a random integer in about a milisecond, so the whole job take days.

Here is a Knuth algorithm implementation as a generator:

from Crypto.Random.random import randint  
import numpy as np

def onthefly(n):
    numbers=np.arange(n,dtype=np.uint32)
    for i in range(n):
        j=randint(i,n-1)
        numbers[i],numbers[j]=numbers[j],numbers[i]
        yield numbers[i]

For n=10 :

gen=onthefly(10)
print([next(gen) for i in range(9)])
print(next(gen))
#[9, 0, 2, 6, 4, 8, 7, 3, 1]
#5

For n=2**32, the generator take a minute to initialize, but calls are O(1).

A slightly modified version of this is what I ended up going with. It's on track to complete in about 8 hours. Thanks! — Daffy, Jul 01 '17 at 21:07

score 3 · Answer 2 · answered Jul 01 '17 at 08:20

If you have a continuous range of numbers, you don't need to store them at all. It is easy to devise a bidirectional mapping between the value in a shuffled list and its position in that list. The idea is to use a pseudo-random permutation and this is exactly what block ciphers provide.

The trick is to find a block cipher that matches exactly your requirement of 32-bit integers. There are very few such block ciphers, but the Simon and Speck ciphers (released by the NSA) are parameterisable and support a block size of 32-bit (usually block sizes are much larger).

This library seems to provide an implementation of that. We can devise the following functions:

def get_value_from_index(key, i):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.encrypt(i)

def get_index_from_value(key, val):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.decrypt(val)

The library works with Python's big integers, so you might not even need to encode them.

A 64-bit key (for example 0x123456789ABCDEF0) is not much. You could use a similar construction that increased the key size in DES to Triple DES. Keep in mind that keys should be chosen randomly and they have to be constant if you want determinism.

If you don't want to use an algorithm by the NSA for that, I would understand. There are others, but I can't find them now. The Hasty Pudding cipher is even more flexible, but I don't know if there is an implementation of that for Python.

This is so funny because the whole purpose of this is to make a "perfect" block cipher. — Daffy, Jul 01 '17 at 20:29
Had you described the purpose, I wouldn't have answered. There you go, this is the [XY problem](https://meta.stackexchange.com/q/66377/266187) at work. — Artjom B., Jul 01 '17 at 20:47

agtoever · Answer 3 · 2017-07-01T22:19:10.390

The class I created uses a bitarray of keep track which numbers have already been used. With the comments, I think the code is pretty self explanatory.

import bitarray
import random


class UniqueRandom:
    def __init__(self):
        """ Init boolean array of used numbers and set all to False
        """
        self.used = bitarray.bitarray(2**32)
        self.used.setall(False)

    def draw(self):
        """ Draw a previously unused number
         Return False if no free numbers are left
        """

        # Check if there are numbers left to use; return False if none are left
        if self._free() == 0:
            return False

        # Draw a random index
        i = random.randint(0, 2**32-1)

        # Skip ahead from the random index to a undrawn number
        while self.used[i]:
            i = (i+1) % 2**32

        # Update used array
        self.used[i] = True

        # return the selected number
        return i

    def _free(self):
        """ Check how many places are unused
        """
        return self.used.count(False)


def main():
    r = UniqueRandom()
    for _ in range(20):
        print r.draw()


if __name__ == '__main__':
    main()

Design considerations
While Garrigan Stafford's answer is perfectly fine, the memory footprint of this solution is much smaller (a bit more than 4 GB). Another difference between our answers is that Garrigan's algorithm takes more time to generate a random number when the amount of generated numbers increases (because he keeps iterating until an unused number is found). This algorithm just looks up the next unused number if a certain number is already used. This makes the time it takes to draw a number every time practically the same, regardless of how far the pool of free numbers is exhausted.

Incrementing `i` while used[i] corresponds to linear probing: https://en.wikipedia.org/wiki/Hash_table#Open_addressing . Consider using quadratic probing instead. From https://en.wikipedia.org/wiki/Linear_probing : "...its performance degrades more quickly at high load factors because of primary clustering, a tendency for one collision to cause more nearby collisions." For example if a "large" list means a billion entries, clustering becomes a concern. — J_H, Sep 23 '17 at 20:28
Actually, by the end you need to check N/2 or billions of times. Not O(1) at all. — Aleksandr Dubinsky, Feb 11 '21 at 03:39

Aleksandr Dubinsky · Answer 4 · 2021-02-11T08:52:28.443

Here is a permutation RNG which I wrote which uses the fact that a squaring a number mod a prime (plus some intricacies) gives a pseudorandom permutation.

https://github.com/pytorch/pytorch/blob/09b4f4f2ff88088306ecedf1bbe23d8aac2d3f75/torch/utils/data/_utils/index_utils.py

Short version:

def _is_prime(n):
    if n == 2:
        return True
    if n == 1 or n % 2 == 0:
        return False

    for d in range(3, floor(sqrt(n)) + 1, 2):  # can use isqrt in Python 3.8
        if n % d == 0:
            return False

    return True


class Permutation(Range):
    """
    Generates a random permutation of integers from 0 up to size.
    Inspired by https://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
    """

    size: int
    prime: int
    seed: int

    def __init__(self, size: int, seed: int):
        self.size = size
        self.prime = self._get_prime(size)
        self.seed = seed % self.prime

    def __getitem__(self, index):
        x = self._map(index)

        while x >= self.size:
            # If we map to a number greater than size, then the cycle of successive mappings must eventually result
            # in a number less than size. Proof: The cycle of successive mappings traces a path
            # that either always stays in the set n>=size or it enters and leaves it,
            # else the 1:1 mapping would be violated (two numbers would map to the same number).
            # Moreover, `set(range(size)) - set(map(n) for n in range(size) if map(n) < size)`
            # equals the `set(map(n) for n in range(size, prime) if map(n) < size)`
            # because the total mapping is exhaustive.
            # Which means we'll arrive at a number that wasn't mapped to by any other valid index.
            # This will take at most `prime-size` steps, and `prime-size` is on the order of log(size), so fast.
            # But usually we just need to remap once.
            x = self._map(x)

        return x

    @staticmethod
    def _get_prime(size):
        """
        Returns the prime number >= size which has the form (4n-1)
        """
        n = size + (3 - size % 4)
        while not _is_prime(n):
            # We expect to find a prime after O(log(size)) iterations
            # Using a brute-force primehood test, total complexity is O(log(size)*sqrt(size)), which is pretty good.
            n = n + 4
        return n

    def _map(self, index):
        a = self._permute_qpr(index)
        b = (a + self.seed) % self.prime
        c = self._permute_qpr(b)
        return c

    def _permute_qpr(self, x):
        residue = pow(x, 2, self.prime)

        if x * 2 < self.prime:
            return residue
        else:
            return self.prime - residue

score 0 · Answer 5 · answered Jul 01 '17 at 06:26

So one way is to keep track of which numbers you have already given out and keep handing out new random numbers one at a time, consider

import random
random.seed(0)

class RandomDeck:
      def __init__(self):
           self.usedNumbers = set()

      def draw(self):
          number = random.randint(0,2**32)
          while number in self.usedNumbers:
                 number = random.randint(0,2**32)
          self.usedNumbers.append(number)
          return number

      def shuffle(self):
          self.usedNumbers = set()

As you can see we essentially have a deck of random numbers between 0 and 2^32 but we only store the numbers we have given out to ensure we don't have repeats. Then you can re-shuffle the deck by forgetting all the numbers you have already given out.

This should be efficient in most use cases as long as you don't draw ~1 million numbers without a reshuffle.

How can I shuffle a very large list stored in a file in Python?

5 Answers5

Linked