Optimize memory usage for generating and splitting list

Question

So, the task is to generate all unique pairs of numbers from 0 to N, which makes the length of the list N * (N - 1).

In order to generate all unique pairs, I have written this function:

def generate(N):
    for i in range(N):
        for j in range(i + 1, N):
            yield(i, j)

The next thing which I should do is to split the list M equal sizes, and I have wroten this:

def divide(full_list, N, M):
    l, r = divmod(N * (N - 1) // 2, M)
    return [full_list[i * l + min(i, r): (i + 1) * l + min(i + 1, r)] for i in range(M)]

Criteria for splitting the list is exactly how it should be. This everything fine, but there are few limits:

N - can be up to ~100000

Memory Usage Limit - 256mb

Time Limit - 2s

The solution above uses more than 256MB when N is large (unfortunately, I don't know on which quantity it fails).

How can I improve and optimize this?

What's the desired output? This sounds like you could work with something other than concrete lists in the `divide` phase... — AKX, Dec 23 '20 at 19:10
it is easy to do (in a few different variants as per my answer, none of which consumes any more memory than the size of one batch) by keeping the whole affair as a chain of iterators. However, the big elephant in the room is: **why do this?** — Pierre D, Dec 23 '20 at 22:10

Pierre D · Accepted Answer · 2020-12-23T22:08:45.230

You can use combinations together with the classic grouper iterator. That way you have a pure streaming iterator (no needless storage of long lists):

# example
n = 4
m = 2

batch_size = n * (n - 1) // (2 * m)
[list(pairs) for pairs in grouper(combinations(range(n), 2), batch_size)]

# out:
[[(0, 1), (0, 2), (0, 3)], [(1, 2), (1, 3), (2, 3)]]

Other example:

batch_size = 8
n = 256  # full list of pairs would be 32_640 long
first_k = 10

for batch in islice(grouper(combinations(range(n), 2), batch_size), first_k):
    print(list(batch))

Output:

[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8)]
[(0, 9), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16)]
[(0, 17), (0, 18), (0, 19), (0, 20), (0, 21), (0, 22), (0, 23), (0, 24)]
[(0, 25), (0, 26), (0, 27), (0, 28), (0, 29), (0, 30), (0, 31), (0, 32)]
[(0, 33), (0, 34), (0, 35), (0, 36), (0, 37), (0, 38), (0, 39), (0, 40)]
[(0, 41), (0, 42), (0, 43), (0, 44), (0, 45), (0, 46), (0, 47), (0, 48)]
[(0, 49), (0, 50), (0, 51), (0, 52), (0, 53), (0, 54), (0, 55), (0, 56)]
[(0, 57), (0, 58), (0, 59), (0, 60), (0, 61), (0, 62), (0, 63), (0, 64)]
[(0, 65), (0, 66), (0, 67), (0, 68), (0, 69), (0, 70), (0, 71), (0, 72)]
[(0, 73), (0, 74), (0, 75), (0, 76), (0, 77), (0, 78), (0, 79), (0, 80)]

Update: example where there is a remainder (n*(n-1)/2 isn't a multiple of m):

n = 5
m = 3

batch_size = n * (n - 1) // (2 * m)
[list(pairs) for pairs in grouper(combinations(range(n), 2), batch_size)]
# out:
[[(0, 1), (0, 2), (0, 3)],
 [(0, 4), (1, 2), (1, 3)],
 [(1, 4), (2, 3), (2, 4)],
 [(3, 4), None, None]]

If one prefers to drop these None elements (the default fillvalue of grouper()), and thus accept a last batch that is smaller:

[[p for p in pairs if p] for pairs in grouper(combinations(range(n), 2), batch_size)]
# out:
[[(0, 1), (0, 2), (0, 3)],
 [(0, 4), (1, 2), (1, 3)],
 [(1, 4), (2, 3), (2, 4)],
 [(3, 4)]]

Update 2: Longer initial batch desired.

It seems (without much of a stated motivation but based on what the OP's divide() does) that they would prefer to have an initial batch longer than the others in the case where there is a remainder (rather than having None fillers, or a shorter batch at the end).

Not a problem, but rather than changing the nice, canonical grouper, we can do our own math and splitting the initial batch. Note that this doesn't impact memory: the whole setup remains a pure streaming iterator:

def gen_pair_batch(n, m):
    n_pairs = n * (n - 1) // 2
    batch_size = n_pairs // m
    initial_batch = batch_size + n_pairs % batch_size
    
    gen_pairs = combinations(range(n), 2)
    yield tuple(islice(gen_pairs, initial_batch))
    yield from grouper(gen_pairs, batch_size)

Example:

>>> list(gen_pair_batch(5, 3))
[((0, 1), (0, 2), (0, 3), (0, 4)),
 ((1, 2), (1, 3), (1, 4)),
 ((2, 3), (2, 4), (3, 4))]

Performance

Either of the versions above use very limited amount of memory (that's the beauty of streaming iterators). However, I doubt you'll be able to achieve your goal of max time 2s for n=100_000:

def go(n):
    n_pairs = n * (n - 1) // 2
    m = max(10, n_pairs // 1000)
    process = psutil.Process()
    mmin, mmax = sys.maxsize, 0
    count = 0
    for batch in gen_pair_batch(n, m):
        # ... do something with the batch of pairs
        mem = process.memory_info().rss
        if mem < mmin: mmin = mem
        if mem > mmax: mmax = mem
        count += 1

    mem1 = process.memory_info().rss
    return count, mmin, mmax

Test:

%%time
count, mmin, mmax = go(n)
print(f'finished go({n}) count = {count}; total resident mem was between: {mmin}, {mmax}')

For n = 1000:

finished go(1000) count = 499; total resident mem was between: 124968960, 124968960
CPU times: user 20.9 ms, sys: 94 µs, total: 21 ms
Wall time: 20.1 ms

For n = 10_000:

finished go(10000) count = 49995; total resident mem was between: 124968960, 124968960
CPU times: user 1.77 s, sys: 164 ms, total: 1.94 s
Wall time: 1.94 s

For n = 100_000:

finished go(100000) count = 4999950; total resident mem was between: 124968960, 124968960
CPU times: user 2min 57s, sys: 17 s, total: 3min 14s
Wall time: 3min 14s

So: no go on time constraint, but the good news is that the memory usage is well behaved. As an aside, I am wondering why you would want to generate those pairs (up to n = 100K, and why in batches, etc.) Given more insight into the actual/larger goal, we could probably come up with something that more efficiently.

This doesn't quite work if there is a remainder. `grouper` will create 4 groups here: `grouper('0123456789', 3)`, but the OP is looking for 3 groups. — Glenn Gribble, Dec 23 '20 at 20:02
@GlennGribble: do you care to elaborate? `grouper` uses any arbitrary value to fill in the remainder. As an alternative, you can discard the remainder if you prefer. — Pierre D, Dec 23 '20 at 20:33
Sure. OP's algo would generate `[('0','1','2'),('3','4','5'),('6','7','8'),('9',None,None)]`. OP's algo would generate `[('0','1','2','3'),('4','5','6'),('7','8','9')]`. — Glenn Gribble, Dec 23 '20 at 21:05
easy enough to do without loss of generality. I still wonder why one would want to do all that anyway. — Pierre D, Dec 23 '20 at 22:15

Optimize memory usage for generating and splitting list

1 Answers1