Efficiently yield elements from large list in (pseudo) random order

Question

I am experimenting with unrolling a few nested loops for (potentially) better performance at the expense of memory. In my scenario, I would end up with a list of about 300M elements (tuples), which I'd have to yield in (more or less) random order.

At this order of magnitude, random.shuffle(some_list) really is not the way to go anymore.

The below example illustrates the issue. Be aware, on an x86_64 Linux and CPython 3.6.4, it will eat about 11 GByte of memory.

def get_random_element():
    some_long_list = list(range(0, 300000000))
    for random_item in some_long_list:
        yield random_item

My thinking so far is to simply generate one random index per iteration and yield randomly picked elements (indefinitely) from the list. It may yield certain elements multiple times and totally skip others, which would be a trade-off worth considering.

What other options do I have within reasonable bounds of memory and CPU time to possibly yield every element of the list only once?

Have you ever consider using `random.choice` or other functions like this? — Feishi, Mar 09 '18 at 07:53

Severin Pappadeux · Accepted Answer · 2018-03-11T16:41:00.037

3

Here is Fisher-Yates-Knuth in-place sampling (https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)

Memory was stable ~4Gb (yes, I was using 100000000)

# Fisher-Yates-Knuth sampling, in-place Durstenfeld version

import numpy as np

def swap(data, posA, posB):
    if posA != posB:
        data[posB], data[posA] = data[posA], data[posB]

def get_random_element(data, datalen):
    pos = datalen

    while pos > 0:
        idx = np.random.randint(low=0, high=pos) # sample in the [0...pos) range

        pos -= 1
        swap(data, idx, pos)

        yield data[pos]


length = 100000000
some_long_list = list(range(0, length))

gen = get_random_element(some_long_list, length)

for k in range(0, length):
    print(next(gen))

UPDATE

For speed, you might want to inline swap() as well

edited Mar 11 '18 at 16:41

answered Mar 09 '18 at 16:08

Severin Pappadeux

18,636
3
38
64

Thanks for this excellent answer. Yep, I am swapping inline - it erases one extra function call. I am also experimenting with putting my tuples (of integers) into a numpy array instead of a Python list ("slightly" more memory efficient), but then the above swapping strategy does not seem to work. But this is just a detail ;) – s-m-e Mar 12 '18 at 16:35
Yep, numpy does not support the above swap strategy: https://stackoverflow.com/q/14933577/1672565 – s-m-e Mar 12 '18 at 16:42
@s-m-e yes, have to distinguish between views and copies. Other strategies to consider are: 1. Bunching - return array of samples per one get_random_element() call. 2. There are two major operations in get_random_element() - one is getting back sampling value and another is to swap elements and adjust position. Might worth considering splitting it (especially if doing bunched sampling) and doing it in the different threads. Might cost you another copy of sampled values (or some lock, or even lockless structure), but swap on just returned bunch could run in parallel to the main processing loop. – Severin Pappadeux Mar 12 '18 at 17:38

Efficiently yield elements from large list in (pseudo) random order

1 Answers1

Linked