Randomly generate all unique pair-wise combination of elements between two list in set time

Question

I have two lists:

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]

And would like to generate all possible combinations with a python generator. I know I could be doing:

combinations = list(itertools.product(a,b))
random.shuffle(combinations)

But that one has an extreme memory cost as i would have to hold in memory all possible combinations, even if only wanted two random unique combinations.

My target is to get a python generator that has its memory cost increase with the more iterations are requested from it, getting to the same O memory cost as itertools at max iterations.

I had this for now:

def _unique_combinations(a: List, b: List):
    """
    Creates a generator that yields unique combinations of elements from a and b
    in the form of (a_element, b_element) tuples in a random order.
    """
    len_a, len_b = len(a), len(b)
    generated = set()
    for i in range(len_a):
        for j in range(len_b):
            while True:
                # choose random elements from a and b
                element_a = random.choice(a)
                element_b = random.choice(b)
                if (element_a, element_b) not in generated:
                    generated.add((element_a, element_b))
                    yield (element_a, element_b)
                    break

But its flawed as it can theoretically run forever if the random.choice lines are unlucky.

I'm looking to modify that existing generator so it generates the indexes randomly within a fix set of time, it will be okay to keep them track of as this will be linear increase in memory cost and not exponential.

How could i modify that random index generator to be bound in time?

How many random pairs do you expect to generate? Because if you need to generate a higher number, then tracking which pairs you already found will mean that you will almost have the full set of possibilities stored in memory anyway. — poke, Jun 18 '23 at 16:02
Does this answer your question? [How to sample from Cartesian product without repetition](https://stackoverflow.com/questions/48686767/how-to-sample-from-cartesian-product-without-repetition) — Sebastian Wozny, Jun 18 '23 at 16:13
Can you define what "randomly" means in your context specifically? — Sebastian Wozny, Jun 18 '23 at 16:14
Depending on your definition of randomness, and if you can accept a pre-determined random sequence, then you could work with a unique LCG to generate the pairs. — poke, Jun 18 '23 at 16:21
i can accept a predetermined random sequence and a and b dont neccesary have to be similar length — eljiwo, Jun 18 '23 at 21:16
Have a look at this: [Iterating over a list in pseudo-random order without storing a shuffled list](https://stackoverflow.com/questions/20190315/iterating-over-a-list-in-pseudo-random-order-without-storing-a-shuffled-list). This might yield a sub-optimal but acceptable solution if you can figure out how to code it. — Bill, Jun 18 '23 at 23:30
@Bill Or [this](https://stackoverflow.com/q/30615536/12671057). — Kelly Bundy, Jun 18 '23 at 23:50

Samwise · Answer 1 · 2023-06-18T16:15:15.347

Random picking is cheap at the beginning and expensive at the end, and exhaustive listing is expensive at the beginning and cheap at the end. Here's a "best of both worlds" approach where we switch strategies halfway through:

import itertools
import random
from typing import Iterator, TypeVar

_A = TypeVar("_A")
_B = TypeVar("_B")

def unique_product(a: list[_A], b: list[_B]) -> Iterator[tuple[_A, _B]]:
    total = len(a) * len(b)
    results: set[tuple[_A, _B]] = set()

    # For the first half of the results, use random guessing.
    # Until we've exhausted half of the possibility
    # space, we should average < 2 guesses per element, so
    # we can consider this to be amortized O(1) per element.
    result = random.choice(a), random.choice(b)
    while len(results) < total // 2:
        while result in results:
            result = random.choice(a), random.choice(b)
        results.add(result)
        yield result

    # For the second half, build the exhaustive set of
    # remaining results.  We pay an O(n) cost to do this but
    # amortized over the entire generator life it's O(1) per
    # element.  Our space usage goes down after this, not up.
    remaining: list[tuple[_A, _B]] = []
    for result in itertools.product(a, b):
        if result in results:
            results.remove(result)
        else:
            remaining.append(result)
    random.shuffle(remaining)
    while remaining:
        yield remaining.pop()

The main potential problem with this approach is that you're paying a big O(n) cost right in the middle, so even though it washes out when you're looking at the entire run, for some use cases it might be undesirable to have a single arbitrary caller in the middle paying that entire cost all at once, as opposed to paying it up front, or evenly spreading it out among all callers. (I can think of ways you might try to avoid this by doing the swap in another thread, but that adds a lot of complexity. Maybe there's a better way?)

Note that as far as space goes this is pretty optimal, since you max out the space halfway through (with half of the elements in memory) and then the space usage decreases back toward zero since it's now cheaper to keep track of the elements you haven't allocated than the ones you have.

You're paying a somewhat large O(n) time cost at some point before the middle already, the last time the set resizes. But that's normal and we usually don't even think about it. Then again, yours in the middle is likely significantly larger still. Oh well. Maybe you could avoid it by building both sets in parallel. For every yielded pair, put the next two pairs from `itertools.product` into the second set (unless they're already in the first, and also remove pairs you yield from the second set). — Kelly Bundy, Jun 18 '23 at 22:08
Posted [a version of that](https://stackoverflow.com/a/76502800/12671057) now. — Kelly Bundy, Jun 18 '23 at 22:48

Sebastian Wozny · Accepted Answer · 2023-06-20T15:35:33.677

We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once. More specifically we are looking for a generator of the multiplicative group of integers modulo n.

We have to pick our prime number a little larger than the product len(a)*len(b), so we have to account for the cases in which we'd get index errors.

import random
from math import gcd
import math

def next_prime(number):
    if number < 0:
        raise ValueError('Negative numbers can not be primes')
    if number <= 1:
        return 2
    if number % 2 == 0:
        number -= 1
    while True:
        number += 2
        max_check = int(math.sqrt(number)) + 2
        for divider in range(3, max_check, 2):
            if number % divider == 0:
                break
        else:
            return number


def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True


def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n - 1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a


def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state > l:
        state = (state * root) % close_prime  # Inlining the computation leads to a 20% speed up

    yield state - 1
    for i in range(l - 1):
        state = (state * root) % close_prime
        while state > l:
            state = (state * root) % close_prime
        yield state - 1

Then we use a mapping from 1D -> 2D to "translate" our sequence number into a tuple and yield the result.

def _unique_combinations(a, b):
    cartesian_product_cardinality = len(a) * len(b)
    sequence = sampler(cartesian_product_cardinality)
    len_b = len(b) # Function calls are expensive in python and this line yields a 10% total speed up
    for state in sequence:
        yield a[state // len_b], b[state % len_b)]


from itertools import product

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
u = _unique_combinations(a, b)

assert sorted(u) == sorted(product(a, b))

I started benchmarking the various approaches. For merging two lists of length 1000, the divmod solution by @gog already underperforms terrible so i'm going to exclude it from further testing:

kelly took 0.9156949520111084 seconds
divmod took 41.20149779319763 seconds
prime_roots took 0.5146901607513428 seconds
samwise took 0.698538064956665 seconds
fisher_yates took 0.902874231338501 seconds

For the remaining algorithms I benchmarked the following

import pandas as pd
import timeit
import random
from itertools import combinations
from math import gcd
# Define the list lengths to benchmark
list_lengths = [10,20,30,100,300,500,1000,1500,2000,3000,5000]

num_repetitions = 2

results_df = pd.DataFrame(columns=['Approach', 'List Length', 'Execution Time'])

for approach, function in approaches.items():
    for length in list_lengths:
        a = list(range(length))
        b = list(range(length))

        execution_time = timeit.timeit(lambda: list(function(a, b)), number=num_repetitions)

        results_df = results_df.append({
            'Approach': approach,
            'List Length': length,
            'Execution Time': execution_time 
        }, ignore_index=True)

All in all, I think all of the approaches are somewhat similar. All tested approaches fall in O(|a|*|b|) time-complexity wise.

Memory-wise the prime roots approach wins just because all other approaches need to keep track of O(|a|*|b|) elements, whereas the prime roots doesn't require that.

Distribution wise the prime roots approach is absolutely the worst because it's not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.

Credit to this stack overflow answer which inspired the solution.

Not sure how good it is in general, but for example `_unique_combinations('abc', 'ABC')` only produces four different orders out of the possible 362880. — Kelly Bundy, Jun 19 '23 at 00:51
This is because if the product of the sequences isn't very large to begin with there are very few roots to choose from. If the sequences are very short, no need for this solution. — Sebastian Wozny, Jun 19 '23 at 00:59
I think this is a very elegant solution! Proposing some changes to compute the closest prime dynamically but works great, thanks! — eljiwo, Jun 19 '23 at 05:21
I've been testing it for a bit and seems find_primitive_roots is quite a time sink, just posted an answer with a mix of all the solutions proposed in the posts to see if its correct. — eljiwo, Jun 19 '23 at 07:38
Maybe calculating a "random" root is good enough, rather than calculating all of them and choosing a random one. — Sebastian Wozny, Jun 19 '23 at 08:35
i really like your idea of `next prime`. primality tests can be done relatively quickly. i was stuck trying to generate all primes ... — Sebastian Wozny, Jun 19 '23 at 08:42
I changed the code to find a random primitive root instead. Should be a lot less of a time sink. — Sebastian Wozny, Jun 19 '23 at 08:58
Do you have code to replicate the execution time? When i did a run with a 1m sequence length your code was getting stuck and not computing, not sure if its was one of the prime calculations. I'll head to bed now so just leaving the question here, otherwise i can try debug tmw — eljiwo, Jun 19 '23 at 22:25
for two sequences of length 1m even `itertools.product` takes 1600 seconds on my macbook — Sebastian Wozny, Jun 20 '23 at 16:01

score 0 · Answer 3 · answered Jun 18 '23 at 16:29

0

Represent each combination by a number (position_in_a * len_a) + position_in_b. Keep generating these numbers randomly, once a number has been hit, just increment it mod len_a * len_b:

import random


def _unique_combinations(a, b):
    sa = len(a)
    sb = len(b)
    sx = sa * sb
    seen = set()
    while len(seen) < sx:
        n = random.randint(0, sx - 1)
        while n in seen:
            n = (n + 1) % sx
        seen.add(n)
        p, q = divmod(n, sa)
        yield a[q], b[p]


##

a = [1, 2, 3, 5]
b = ["a", "b", "c", "d"]
   
u = list(_unique_combinations(a, b))

print(u)

# confirm everything has been generated

from itertools import product
assert sorted(u) == sorted(product(a, b))


# [(3, 'c'), (1, 'c'), (5, 'c'), (2, 'b'), (5, 'b'), (1, 'b'), (2, 'c'), (3, 'd'), (2, 'a'), (3, 'b'), (1, 'd'), (2, 'd'), (1, 'a'), (5, 'd'), (3, 'a'), (5, 'a')]

answered Jun 18 '23 at 16:29

gog

10,367
2
24
38

2

By the time you've finished `seen` has n elements in it where n is the total number of combinations. Isn't the goal here to avoid holding n items in memory? If you are going to store these, it seems like you could just shuffle the integers 0 - n, use them as indices, and be done with it without the extra loops when random items exist. – Mark Jun 18 '23 at 16:42
@Mark: you store as much as you consume. If you only need two items from a ten million list, you will end up storing two numbers, not ten millions. `list(_unique_combinations` above is just for testing. – gog Jun 18 '23 at 19:06
This is biased, some orders are more likely than others. – Kelly Bundy Jun 18 '23 at 21:30
Try for example `print(sorted(Counter(tuple(_unique_combinations('ab', 'cd')) for _ in range(100000)).values()))`, giving frequencies like `[1443, 1516, 1565, 1580, 3058, 3067, 3072, 3099, 3115, 3116, 3151, 3153, 3171, 3180, 3211, 3219, 4563, 4614, 4678, 4715, 9367, 9403, 9469, 9475]`. – Kelly Bundy Jun 18 '23 at 21:37

Bill · Answer 4 · 2023-06-18T18:36:21.893

Does this do what you wanted?

You can generate any combination from the list of all possible combinations using an integer index:

def get_combination(x, values):

    digits = []
    for items in reversed(values):
        base = len(items)
        i = int(x % base)
        digits.append(items[i])
        x = x // base

    digits.reverse()

    return digits

values = [[1, 2, 3, 5], ["a", "b", "c", "d"]]
assert(get_combination(0, values) == [1, 'a'])
assert(get_combination(1, values) == [1, 'b'])
assert(get_combination(15, values) == [5, 'd'])

Therefore you don't need to generate the shuffled list of combinations. I don't think there is any way to iteratively sample the integers in a range without repetition without generating the list (as explained in the answers to this question) but at least now you only need to generate an array of integers, which takes less memory:

import numpy as np

rng = np.random.default_rng()

def shuffled_combinations(values):
    counts = [len(x) for x in values]
    n = np.array(counts).prod()
    index = np.arange(n, dtype='uint')
    rng.shuffle(index)
    for i in index:
        yield get_combination(i, values)

for c in shuffled_combinations(values):
    print(c)

Output:

[1, 'd']
[2, 'c']
[3, 'a']
[5, 'a']
[5, 'd']
[5, 'b']
[3, 'c']
[2, 'b']
[1, 'a']
[1, 'b']
[5, 'c']
[1, 'c']
[3, 'b']
[2, 'a']
[2, 'd']
[3, 'd']

well the index variable would be size a*b right? Wouldnt that take the maximum memory cost already at start? — eljiwo, Jun 18 '23 at 21:48
Yes it will take up `len(a) * len(b) * np.dtype('uint').itemsize` bytes of memory (128 in the example). But this may be less than storing the list of item combinations. You might be able to reduce the memory by using `uint32` or `uint16` in some cases. — Bill, Jun 18 '23 at 23:19

Kelly Bundy · Answer 5 · 2023-06-18T22:59:29.683

A variation of Samwise's, but avoiding the large middle cost to create remaining by spreading its creation over the first half of the process and the cost to randomize it over the second half of the process. The conversion from set to list is relatively fast.

I suspect it's slower than Samwise's overall (and it does use more memory). It just might be better if the delay in the middle is unacceptable.

import random
import itertools

def random_order_product(a, b):
    yielded = set()
    remaining = set()

    for i, pair in enumerate(itertools.product(a, b)):
        if pair not in yielded:
            remaining.add(pair)
        if i % 2:
             pair = random.choice(a), random.choice(b)
             while pair in yielded:
                 pair = random.choice(a), random.choice(b)
             yield pair
             yielded.add(pair)
             remaining.discard(pair)

    remaining = list(remaining)
    while remaining:
        i = random.randrange(len(remaining))
        yield remaining[i]
        remaining[i] = remaining[-1]
        remaining.pop()


# Demo showing the frequencies of the 4!=24 possible orders
from collections import Counter
print(sorted(Counter(
    tuple(random_order_product('ab', 'cd'))
    for _ in range(100000)
).values()))

Sample output of the order frequencies (Attempt This Online!):

[4045, 4078, 4107, 4112, 4113, 4113,
 4127, 4131, 4131, 4135, 4136, 4142,
 4149, 4164, 4172, 4186, 4188, 4196,
 4212, 4235, 4245, 4260, 4279, 4344]

DebugCode · Answer 6 · 2023-06-18T17:16:33.450

-3

@gog: The code snippet has a limitation in terms of scalability. It is utilizing a set to track generated combinations, the memory usage and performance become problematic as the total number of possible combinations increases.

edited Jun 18 '23 at 17:16

answered Jun 18 '23 at 17:01

DebugCode

50
10

1

This should be a comment on @gog's answer. – Bill Jun 18 '23 at 18:37

score -3 · Answer 7 · answered Jun 18 '23 at 17:09

Kindly, before you engage in writing such a program, let me introduce you to 'Permutations and Combinations'. Let's say, you have a list names fruits (fruits = ['apples','mangoes','grapes']). The number of times you can arrange the list is called permutation. This is expressed mathematically as (!). Now, our list contains three items. We can find the number of possible arrangements by (3!) which is equal to 6. Now, you have only six moves or possible shuffles. Combinations on the other hand is basically choosing from a list a certain number of permutations from specific items for example, let's say in our list, you want to find out the number of combinations of just two items. This can be expressed mathematically as (2C3) where 2 is the number of items and 3 is the total number of items. This will give you 3. However, in python, I would advise you to use itertools. This is an amazing module which will make your work easier. However, I would like you to visit the following link for more insight. https://www.digitalocean.com/community/tutorials/permutation-and-combinatios-in-python

I actually like your intention of pointing the OP to study material on the subject for a deeper understanding of the problem. — Glstunna, Jun 19 '23 at 18:21
The OP was not looking for how to simply generate all the permutations. They wanted to know how to sample from the full list randomly without storing the full list in memory, which is apparently not possible using the itertools library. Therefore this answer is not helpful. If @JGithaka knows how to do this they should post the code to do it. — Bill, Jun 19 '23 at 22:26

Randomly generate all unique pair-wise combination of elements between two list in set time

7 Answers7