How to iterate over two sorted lists in largest pairs order in Python

Question

I have two sorted iterables, e.g.:

a = [C, B, A]
b = [3, 2, 1]

I want to generate a list of all possible pairs, combining the largest (lowest index) pairs first. Largest means all combinations of element < 1 first (for both iterables), then all combinations of index < 2, etc. Desired result:

combine(a, b)
>> ((C, 3), (B, 3), (C, 2), (B, 2), (A, 3), (A, 2), (C, 1), (B, 1), (A, 1))

I looked at itertools.product(), but this generates the pairs in the wrong order.

Since the inputs are iterables (not necessarily lists), I'd prefer a library function that can handle iterables. The iterables may be long (millions of elements), so it would be desirable to avoid storing them all in memory. For the same reason, generating in the wrong order and then sorting is undesirable.

The output should be a generator, so that it's not necessary to iterate over all the combinations if not all are required.

"I looked at itertools.product(), but this only generates the cartesian product." The desired output that you show contains every element that the cartesian product contains, and no others. The only apparent problem is the order - and this isn't well specified. I can't understand what you mean by "largest pairs". Why is `(B, 3)` "larger" than `(C, 2)`, but `(B, 2)` is "larger" than `(A, 3)`? What is the actual rule? (Also, are `A`, `B`, `C` supposed to be strings? Or are they other variables in your program? In that case, how should we know how to sort them without knowing the values?) — Karl Knechtel, May 16 '22 at 22:36
" The iterables may be long (millions of elements)" Make sure you have a good idea of how many pairs will be generated, and how long the code will have to spend iterating over them - even if it doesn't have to store them all. — Karl Knechtel, May 16 '22 at 22:37
If you don't want to generate the whole Cartesian product then sort, a possible trick is to use a priority queue (aka a heap: import heapq). If `C` is the largest element from the first tierable and `3` is the largest from the second iterable, then obviously the largest pair will be `(C, 3)`. Then the second-largest pair will be either `(B, 3)` or `(C, 2)`; add them both to the priority queue. — Stef, May 16 '22 at 22:40
Does this answer your question? [Generate first N combinations of length L from weighted set in order of weight](https://stackoverflow.com/questions/72120390/generate-first-n-combinations-of-length-l-from-weighted-set-in-order-of-weight) — Stef, May 16 '22 at 22:40
If generating the whole cartesian product is ok, of course, you can just generate the whole cartesian product then sort it. — Stef, May 16 '22 at 22:41
Could you clarify what you mean by "largest pair" -- how are you defining the size / weight of a pair? — BrokenBenchmark, May 16 '22 at 22:42
@KarlKnechtel I've clarified the question to address your comments. It's specifically because the processing is slow that I need to generate the pairs in this order...the end function will terminate early if the correct properties are met. — Stefan, May 16 '22 at 23:11
Ah, so the point is to lazily generate them, in a way that doesn't have to go all the way through one input before advancing to the next element in the other? (i.e., a way that would also work if the inputs are lazily generated) I know of ways to do this, if you can accept O(N) storage of elements already "seen". Otherwise I don't think it's possible. — Karl Knechtel, May 16 '22 at 23:13
@KarlKnechtel Exactly, the iterables may have millions of entries, but in most cases the desired outcome will be found in combinations of the first few entries. So it's much more efficient to try those first. — Stefan, May 16 '22 at 23:15
@KarlKnechtel If the iterables can be iterated multiple times, we can do it with O(1) memory. — Kelly Bundy, May 16 '22 at 23:17
If they're `range`s or something, yes. Otherwise they are already taking up O(N) memory themselves ;) — Karl Knechtel, May 16 '22 at 23:28

Kelly Bundy · Accepted Answer · 2022-05-17T10:39:07.327

Using the roundrobin recipe that Karl mentioned (copied verbatim from the recipes, could also import it from more-itertools). I think this will be faster, since all the hard work is done in C code of various itertools.

from itertools import repeat, chain, cycle, islice

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    num_active = len(iterables)
    nexts = cycle(iter(it).__next__ for it in iterables)
    while num_active:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            num_active -= 1
            nexts = cycle(islice(nexts, num_active))

def pairs(a, b):
    aseen = []
    bseen = []
    def agen():
        for aa in a:
            aseen.append(aa)
            yield zip(repeat(aa), bseen)
    def bgen():
        for bb in b:
            bseen.append(bb)
            yield zip(aseen, repeat(bb))
    return chain.from_iterable(roundrobin(agen(), bgen()))

a = ['C', 'B', 'A']
b = [3, 2, 1]
print(*pairs(a, b))

Output (Try it online!):

('C', 3) ('B', 3) ('C', 2) ('B', 2) ('A', 3) ('A', 2) ('C', 1) ('B', 1) ('A', 1)

Benchmark with two iterables of 2000 elements each (Try it online!):

 50 ms   50 ms   50 ms  Kelly
241 ms  241 ms  242 ms  Karl

Alternatively, if the two iterables can be iterated multiple times, we don't need to save what we've seen (Try it online!):

def pairs(a, b):
    def agen():
        for i, x in enumerate(a):
            yield zip(repeat(x), islice(b, i))
    def bgen():
        for i, x in enumerate(b, 1):
            yield zip(islice(a, i), repeat(x))
    return chain.from_iterable(roundrobin(agen(), bgen()))

(Will add to the benchmark later... Should be a little slower than my first solution.)

An extreme map/itertools version of that (Try it online!):

def pairs(a, b):
    return chain.from_iterable(roundrobin(
        map(zip,
            map(repeat, a),
            map(islice, repeat(b), count())),
        map(zip,
            map(islice, repeat(a), count(1)),
            map(repeat, b))
    ))

Ah, round-robin'ing *the generators* to avoid the tracking problem! Neat. — Karl Knechtel, May 17 '22 at 00:49
@KarlKnechtel [Short solution](https://tio.run/##XVJNb6MwEL3zK0btAbuCSGkvq0itlPRLe@mhPUYIGZg00xLbMs429M9nPWY3gY6EEfN4bzxvxvZ@a/TNL@uOx40zOyCPzhvTdkA7a5wHhxaVz6Du6xYzoK6lGpOkwQ04s9eNMxVpccU8VbXYyUUCIS5GYLpc3acZpA98PD6lEvL8DpbwAI@wgie4v4iUS3jFmixC7bAJeg14A89o3DvCm/r8pC7@pve7UtWe/iDcQotanEsPOB58F6B44QiGQ87KkoGyhI1xoUsgDT@IX1tqcSQ/NMLhXX/@4GANlmOVWG8Kc/SEbRNBIU8gHmq0Ht68sb@5uCejp1Q2YWdCb36LwwV9qPWF8LHvfOBvVXgHa@Kw@J/Y5myiMXIov4X5FJu6E6cpYjIb8aQcJmwVuU6oDKp/Y@0QNdNfjA7LsC74Sf47cgjrwY6MRv9NlunDEom5lBlwqjql8pCTZwtYpmeNWGidUzG1Z3BVhEq9XC8WVJzQgUDFTFmLuhGH0IIKN12ncfdWfCzTIqk4d5PBdQbzIrGOtBdXoz7l8fgX). — Kelly Bundy, May 17 '22 at 01:14

score 2 · Answer 2 · answered May 16 '22 at 23:25

If you aren't too picky about the order, per the comments, and the goal is just lazy generation that considers the inputs "evenly": here is a simple approach for two inputs. Simply track the elements that have been "seen" so far from each input, while iterating over the two inputs in round-robin fashion (also see the recipes in the documentation). With each new element, yield the pairs that it makes with elements from the other input.

To make that work, though, we need to be aware of which source we are pulling from in the round-robin iteration. We could modify the implementation of roundrobin so that it yields that information in some encoded form; but I think it will be easier for our purposes to just write the whole algorithm directly, taking inspiration from there.

Thus:

def lazy_pairs(a, b):
    a_seen, b_seen = [], []
    a_active, b_active = True, True
    a_iter, b_iter = iter(a), iter(b)
    while a_active or b_active:
        if a_active:
            try:
                a_next = next(a_iter)
            except StopIteration:
                a_active = False
            else:
                for b_old in b_seen:
                    yield (a_next, b_old)
                a_seen.append(a_next)
        if b_active:
            try:
                b_next = next(b_iter)
            except StopIteration:
                b_active = False
            else:
                for a_old in a_seen:
                    yield (a_old, b_next)
                b_seen.append(b_next)

Generalizing for more inputs is left as an exercise. (Note that you could fall back on itertools.product to find all the possibilities involving the element just taken from one input, and seen elements from all the others.)

As deftly alluded to in the question comments: if we know that the inputs are sequences, then we can just track indices into them and retrieve the 'seen' elements from the originals, rather than building up our own lists. But in this situation, it seems less likely that the iteration order would matter in the same way. — Karl Knechtel, May 16 '22 at 23:30
As it happens, one of the projects on my design backburner is a class to represent an iterable with "backing storage", so that you can consider previous elements without having to load all the next ones. I had intended to include this very algorithm in the package. — Karl Knechtel, May 16 '22 at 23:32
With `yield from zip(repeat(a_next), b_seen)` (and likewise for the other), yours would only take ~3 times as long as mine. — Kelly Bundy, May 17 '22 at 00:43

How to iterate over two sorted lists in largest pairs order in Python

2 Answers2