A memory efficient way for a randomized single pass over a set of indices

Question

I have a big file (about 1GB) which I am using as a basis to do some data integrity testing. I'm using Python 2.7 for this because I don't care so much about how fast the writes happen, my window for data corruption should be big enough (and it's easier to submit a Python script to the machine I'm using for testing)

To do this I'm writing a sequence of 32 bit integers to memory as a background process while other code is running, like the following:

from struct import pack
with open('./FILE', 'rb+', buffering=0) as f:
    f.seek(0)
    counter = 1
    while counter < SIZE+1:
        f.write(pack('>i', counter))
        counter+=1

Then after I do some other stuff it's very easy to see if we missed a write since there will be a gap instead of the sequential increasing sequence. This works well enough. My problem is some data corruption cases might only be caught with random I/O (not sequential like this) based on how we track changes to files

So what I need is a method for performing a single pass of random I/O over my 1GB file, but I can't really store this in memory since 1GB ~= 250 million 4-byte integers. Considered chunking up the file into smaller pieces and indexing those, maybe 500 KB or something, but if there is a way to write a generator that can do the same job that would be awesome. Like this:

from struct import pack

def rand_index_generator:
    generator = RAND_INDEX(1, MAX+1, NO REPLACEMENT)
    counter = 0
    while counter < MAX:
        counter+=1
        yield generator.next_index()

with open('./FILE', 'rb+', buffering=0) as f:
    counter = 1
    for index in rand_index_generator:
        f.seek(4*index)
        f.write(pack('>i', counter))
        counter+=1

I need it:

Not to run out of memory (so no pouring the random sequence into a list)
To be reproducible so I can verify these values in the same order later

Is there a way to do this in Python 2.7?

That’s pseudocode, just giving an example of functionality I was thinking of. random.choices will build a list which doesn’t work because I’ll run out of memory — Henry Dikeman, Apr 16 '22 at 20:13
Are you asking this? https://stackoverflow.com/questions/16165128/lazy-shuffle-algorithms — mkrieger1, Apr 16 '22 at 20:15
One approach is a [linear congruential generator](https://en.wikipedia.org/wiki/Linear_congruential_generator) (LCG) that has a period of 256 million. The period of an LCG is usually a power of 2, so the file would need exactly 2^28 integers. The sequence of the LCG is repeatable. Start it with any seed when writing, and then restart it with the same seed to verify. For an LCG with a period of 2^32, a common implementation is `seed = seed * 0xdeece66d + 0xb;` You can try `seed = seed * 0xeece66d + 0xb` and mask the upper 4 bits. Check that it generates all 2^28 values with no repeats. — user3386109, Apr 16 '22 at 20:54
https://stackoverflow.com/questions/51412182/iteratively-generating-a-permutation-of-natural-numbers/51414162#51414162 — Matt Timmermans, Apr 16 '22 at 21:21
@user3386109: if an LCG has a complete cycle mod 2\**n, it also has a complete cycle mod 2\**k for any k<=n. In particular, the last bit alternates between 0 and 1. That's precisely why using the low-order bits of an LCG-generated random sequence is a really bad idea. — rici, Apr 17 '22 at 00:04
@mkrieger that had one approach that was suitable for me, iteratively generating indices with coprime numbers. Easy for me since my indices are [0,2^n] so I just need a big odd number and don’t care about true pseudorandomness too much. I think the other methods people mentioned are a bit too heavyweight for my purpose. Thank you :) — Henry Dikeman, Apr 17 '22 at 21:11

score 0 · Answer 1 · answered Apr 18 '22 at 17:56

Just to provide an answer for anyone who has the same problem, the approach that I settled on was this, which worked well enough if you don't need something all that random:

def rand_index_generator(a,b):
    ctr=0
    while True:
        yield (ctr%b)
        ctr+=a

Then, initialize it with your index size, b and a value a which is coprime to b. This is easy to choose if b is a power of two, since a just needs to be an odd number to make sure it isn't divisible by 2. It's a hard requirement for the two values to be coprime, so you might have to do more work if your index size b is not such an easily factored number as a power of 2.

index_gen = rand_index_generator(1934919251, 2**28)

Then each time you want the new index you use index_gen.next() and this is guaranteed to iterate over numbers between [0,2^28-1] in a semi-randomish manner depending on your choice of 'a'

There's really no point in picking an a value larger than your index size, since the mod gets rid of the remainder anyways. This isn't a very good approach in terms of randomness, but it's very efficient in terms of memory and speed which is what I care about for simulating this write workload.

A memory efficient way for a randomized single pass over a set of indices

1 Answers1