You can use combinations
together with the classic grouper
iterator. That way you have a pure streaming iterator (no needless storage of long lists):
# example
n = 4
m = 2
batch_size = n * (n - 1) // (2 * m)
[list(pairs) for pairs in grouper(combinations(range(n), 2), batch_size)]
# out:
[[(0, 1), (0, 2), (0, 3)], [(1, 2), (1, 3), (2, 3)]]
Other example:
batch_size = 8
n = 256 # full list of pairs would be 32_640 long
first_k = 10
for batch in islice(grouper(combinations(range(n), 2), batch_size), first_k):
print(list(batch))
Output:
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8)]
[(0, 9), (0, 10), (0, 11), (0, 12), (0, 13), (0, 14), (0, 15), (0, 16)]
[(0, 17), (0, 18), (0, 19), (0, 20), (0, 21), (0, 22), (0, 23), (0, 24)]
[(0, 25), (0, 26), (0, 27), (0, 28), (0, 29), (0, 30), (0, 31), (0, 32)]
[(0, 33), (0, 34), (0, 35), (0, 36), (0, 37), (0, 38), (0, 39), (0, 40)]
[(0, 41), (0, 42), (0, 43), (0, 44), (0, 45), (0, 46), (0, 47), (0, 48)]
[(0, 49), (0, 50), (0, 51), (0, 52), (0, 53), (0, 54), (0, 55), (0, 56)]
[(0, 57), (0, 58), (0, 59), (0, 60), (0, 61), (0, 62), (0, 63), (0, 64)]
[(0, 65), (0, 66), (0, 67), (0, 68), (0, 69), (0, 70), (0, 71), (0, 72)]
[(0, 73), (0, 74), (0, 75), (0, 76), (0, 77), (0, 78), (0, 79), (0, 80)]
Update: example where there is a remainder (n*(n-1)/2
isn't a multiple of m
):
n = 5
m = 3
batch_size = n * (n - 1) // (2 * m)
[list(pairs) for pairs in grouper(combinations(range(n), 2), batch_size)]
# out:
[[(0, 1), (0, 2), (0, 3)],
[(0, 4), (1, 2), (1, 3)],
[(1, 4), (2, 3), (2, 4)],
[(3, 4), None, None]]
If one prefers to drop these None
elements (the default fillvalue
of grouper()
), and thus accept a last batch that is smaller:
[[p for p in pairs if p] for pairs in grouper(combinations(range(n), 2), batch_size)]
# out:
[[(0, 1), (0, 2), (0, 3)],
[(0, 4), (1, 2), (1, 3)],
[(1, 4), (2, 3), (2, 4)],
[(3, 4)]]
Update 2: Longer initial batch desired.
It seems (without much of a stated motivation but based on what the OP's divide()
does) that they would prefer to have an initial batch longer than the others in the case where there is a remainder (rather than having None
fillers, or a shorter batch at the end).
Not a problem, but rather than changing the nice, canonical grouper
, we can do our own math and splitting the initial batch. Note that this doesn't impact memory: the whole setup remains a pure streaming iterator:
def gen_pair_batch(n, m):
n_pairs = n * (n - 1) // 2
batch_size = n_pairs // m
initial_batch = batch_size + n_pairs % batch_size
gen_pairs = combinations(range(n), 2)
yield tuple(islice(gen_pairs, initial_batch))
yield from grouper(gen_pairs, batch_size)
Example:
>>> list(gen_pair_batch(5, 3))
[((0, 1), (0, 2), (0, 3), (0, 4)),
((1, 2), (1, 3), (1, 4)),
((2, 3), (2, 4), (3, 4))]
Performance
Either of the versions above use very limited amount of memory (that's the beauty of streaming iterators). However, I doubt you'll be able to achieve your goal of max time 2s for n=100_000
:
def go(n):
n_pairs = n * (n - 1) // 2
m = max(10, n_pairs // 1000)
process = psutil.Process()
mmin, mmax = sys.maxsize, 0
count = 0
for batch in gen_pair_batch(n, m):
# ... do something with the batch of pairs
mem = process.memory_info().rss
if mem < mmin: mmin = mem
if mem > mmax: mmax = mem
count += 1
mem1 = process.memory_info().rss
return count, mmin, mmax
Test:
%%time
count, mmin, mmax = go(n)
print(f'finished go({n}) count = {count}; total resident mem was between: {mmin}, {mmax}')
For n = 1000
:
finished go(1000) count = 499; total resident mem was between: 124968960, 124968960
CPU times: user 20.9 ms, sys: 94 µs, total: 21 ms
Wall time: 20.1 ms
For n = 10_000
:
finished go(10000) count = 49995; total resident mem was between: 124968960, 124968960
CPU times: user 1.77 s, sys: 164 ms, total: 1.94 s
Wall time: 1.94 s
For n = 100_000
:
finished go(100000) count = 4999950; total resident mem was between: 124968960, 124968960
CPU times: user 2min 57s, sys: 17 s, total: 3min 14s
Wall time: 3min 14s
So: no go on time constraint, but the good news is that the memory usage is well behaved. As an aside, I am wondering why you would want to generate those pairs (up to n = 100K
, and why in batches, etc.) Given more insight into the actual/larger goal, we could probably come up with something that more efficiently.