13

I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).

I tried this:

# coding: utf-8
import random

COUNT = 100000000

random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
    file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()

But it's eating all of my memory.

Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?

warvariuc
  • 57,116
  • 41
  • 173
  • 227

4 Answers4

9

If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).

As DSM pointed out, this can be done with the standard modules in an efficient way:

>>> import array
>>> a = array.array('I', xrange(10**8))  # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random                                                               
>>> random.shuffle(a)

It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:

>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32')  # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)

(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).


Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.

PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 2
    Even without `numpy`, I think `a = array.array('I', xrange(10**8))` and `random.shuffle(a)` would achieve the same end. If the N is small enough, this is far and away the simplest route to the goal. – DSM Apr 27 '13 at 12:43
  • @DSM: Very good point, thanks! – Eric O. Lebigot Apr 27 '13 at 12:53
  • I accepted your answer - it helped to generate the needed data. Still it would be nice to see an answer with a generator. – warvariuc Apr 27 '13 at 13:04
  • Using a bitvector library, you can store a set of integers `0 <= x < 100 million` in about 12.5 megabytes - the presence or absence of each integer is indicated by a single bit. There's a bitvector library in they [Python Package Index](https://pypi.python.org/pypi/BitVector/) - I imagine it supports all the needed operations but haven't checked. Depending on how it's accessed, a bitvector will often be fast because it's compact, but it can occasionally be slow because the bit-fiddling overhead can sometimes outweigh the memory bandwidth advantage. –  Apr 27 '13 at 13:23
  • Generating a sequence of integers in pseudo-random order without storing them all somehow is very difficult. It's not too hard to generate a sequence of the right length without repetition - you just need a 1:1 mapping between 0..n in order and your pseudo-random order sequence. The trouble is deriving a suitable 1:1 mapping of the right size that's reasonably random and, therefore, not expressible using a simple calculation. –  Apr 27 '13 at 13:29
  • 2
    The mersenne twister `random` uses has a period of 2^19937 - 1 (according to Wikipedia). If I didn't mess up fiddling with Wolfram alpha (it refused to evaluate these directly, so I had to use Stirling's approximation), 100 million ! is roughly 2^(10^9.4), so your PS is correct. The period is still very impressive. –  Apr 27 '13 at 13:43
4

Maybe something like (won't be consecutive, but will be unique):

from uuid import uuid4

def unique_nums():  # Not strictly unique, but *practically* unique
    while True:
        yield int(uuid4().hex, 16)
        # alternative yield uuid4().int

unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Looks like it was quite simple! Is there a way to repeat the sequence having a seed? – warvariuc Apr 27 '13 at 13:09
  • 1
    For the record, this doesn't guarantee uniqueness, although they *are* unique over the first 10^8 numbers. It's basically just taking really big random numbers and then observing that there are no collisions. – DSM Apr 27 '13 at 13:10
  • 1
    Here is a reference regarding the likeliness of collisions: http://en.wikipedia.org/wiki/Uuid#Random_UUID_probability_of_duplicates – Eric O. Lebigot Apr 27 '13 at 13:45
  • @EOL thanks for that link - very useful reference – Jon Clements Apr 27 '13 at 13:47
0

You can fetch random int easily from reading (on linux) /dev/urandom or using os.urandom() and struct.unpack():

Return a string of n random bytes suitable for cryptographic use.

This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom. If a randomness source is not found, NotImplementedError will be raised.

>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
... 
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6

While on the other hand random package:

However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.

If you really need unique records you should go with this or answer provided by EOL.

But assuming really random source, with possibly repeated characters you'll have 1/N (where N = 2 ** sizeof(int)*8 = 2 ** 32) chance of hitting item at first guess, thus you can get (2**32) ** length possible outputs.

On the other hand when using just unique results you'll have max:

product from i = 0 to length {2*32 - i} 
               = n! / (n-length)!
               = (2**32)! / (2**32-length)!

Where ! is factorial, not logical negation. So you'll just decrease randomness of result.

Community
  • 1
  • 1
Vyktor
  • 20,559
  • 6
  • 64
  • 96
0

This one will keep your memory OK but will probably kill your disk :)

It generates a file with the sequence of the numbers from 0 to 100000000 and then it randomly pick positions in it and writes to another file. The numbers have to be re-organized in the first file to "delete" the numbers that have been chosen already.

import random

COUNT = 100000000

# Feed the file
with open('file1','w') as f:
    i = 0
    while i <= COUNT:
        f.write("{0:08d}".format(i))
        i += 1

with open('file1','r+') as f1:
    i = COUNT
    with open('file2','w') as f2:
        while i >= 0:
            f1.seek(i*8)
            # Read the last val
            last_val = f1.read(8)
            random_pos = random.randint(0, i)
            # Read random pos
            f1.seek(random_pos*8)
            random_val = f1.read(8)
            f2.write('ID{0},A{0}\n'.format(random_val))
            # Write the last value to this position
            f1.seek(random_pos*8)
            f1.write(last_val)
            i -= 1
print "Done"
jujaro
  • 447
  • 2
  • 4
  • A detail: your `while` loops are normally written as `for … in xrange(…)` loops. – Eric O. Lebigot Apr 27 '13 at 13:52
  • Interesting algorithm for generating a permutation. It would be useful if you explained it with words too: this would make your answer much easier to read. Note that you generate `COUNT+1` numbers instead of `COUNT`. I would also like to note that it could obviously made a tad more efficient (factor 2 in disk usage) by using a binary representation instead of a text one. – Eric O. Lebigot Apr 27 '13 at 14:02
  • There is a similar but simpler method (with a single file) at http://stackoverflow.com/a/196065/42973, with explanations! – Eric O. Lebigot Apr 27 '13 at 14:11
  • Thanks for the comments @EOL, I will add some explanations. – jujaro Apr 27 '13 at 14:13