Keep Python from Providing Duplicate Outputs

Question

My goal is to have a range of items provide as many outputs as can be had, but without duplicate outputs. The code I've provided is a small sample of what I'm working on. For my larger data set, I notice duplicate outputs plague the CSV file when the script runs so I was wondering if there is a way to keep duplicate outputs from processing while maintaining the high range (100, 250, 400, etc)?

import random
Saying = ["I Like"]

Food = ['Coffee', 'Pineapples', 'Avocado', 'Bacon']
Holiday = ['on the 4th of July', 'on April Fools', 'during Autumn', 'on Christmas']

for x in range(10):
    One = random.choice(Saying)
    Two = random.choice(Food)
    Three = random.choice(Holiday)
    print(f'{One} {Two} {Three}')

Thanks for the help!

Possible duplicate of [All combinations of a list of lists](https://stackoverflow.com/questions/798854/all-combinations-of-a-list-of-lists) — DSC, Jul 02 '19 at 07:56
Possible duplicate of [how to sample from cartesian product without repetition in python?](https://stackoverflow.com/questions/48686767/how-to-sample-from-cartesian-product-without-repetition-in-python) — Georgy, Jul 02 '19 at 08:35

Afik Friedberg · Answer 1 · 2019-07-02T08:00:18.787

0

You can use set with the element that you already seen and then check if you see the element in the set with complexity of O(1) in average.

Another option is to shuffle your list and pop some element:

import random

random.shuffle(lst)

while lst:
    element = x.pop()

edited Jul 02 '19 at 08:00

answered Jul 02 '19 at 07:55

Afik Friedberg

332
2
8

1

Still remains inefficient since you'll create many duplicates – DSC Jul 02 '19 at 07:56

score 0 · Answer 2 · answered Jul 02 '19 at 08:13

You can use np.random.choice with parameter replace=False. Moreover you can samples as much sample as you want using size argument.

import numpy as np

Food = ['Coffee', 'Pineapples', 'Avocado', 'Bacon']
Holiday = ['on the 4th of July', 'on April Fools', 'during Autumn', 'on Christmas']

np.random.choice(Food, size=4, replace=False)
>>> array(['Avocado', 'Coffee', 'Pineapples', 'Bacon'], dtype='<U10')

np.random.choice(Holiday, size=4, replace=False)
>>> array(['on April Fools', 'on the 4th of July', 'during Autumn',
       'on Christmas'], dtype='<U18')

score 0 · Accepted Answer · answered Jul 02 '19 at 08:17

The issue is that your bot (i guess?) has no memory of what the outputs have been so far, so there's really no way to check with the code you have.

try with this instead:

import random
Saying = ["I Like"]

Food = ['Coffee', 'Pineapples', 'Avocado', 'Bacon']
Holiday = ['on the 4th of July', 'on April Fools', 'during Autumn', 'on Christmas']

memory=[]
done = False

while not done:
    One = random.choice(Saying)
    Two = random.choice(Food)
    Three = random.choice(Holiday)

    if f'{One} {Two} {Three}' not in memory:
        memory.append(f'{One} {Two} {Three}')
        if len(memory) == 10:
            done = True

[print(item) for item in memory]

so now instead of taking 10 potshots at creating 10 phrases, we're taking as many as it takes to create 10 different ones.

Thanks everyone! I really appreciate the help from this community for looking into my question. It works, yay! — Doug Morrow, Jul 02 '19 at 16:50

score 0 · Answer 4 · answered Jul 02 '19 at 08:33

You can generate random output while maintaining non-redundant data by:

First creating a list permutations which is basically product of lists to be permutated.

permutations = list(itertools.product(*Statement))
## Example - [('I Like', 'Coffee', 'on the 4th of July'), ('I Like', 'Coffee', 'on April Fools'), ('I Like', 'Coffee', 'during Autumn'), ('I Like', 'Coffee', 'on Christmas')]

Pick out elements from the permutations by randomly selecting index and printing it.

 num = int(random.random() * total_elements)
 print '{} {} {}'.format(permutations[num][0], permutations[num][1], permutations[num][2])

Next, we remove the element from list permutations so as to avoid redunancy.

del permutations[num]

Complete code:

import itertools, random
Saying = ["I Like"]

Food = ['Coffee', 'Pineapples', 'Avocado', 'Bacon']
Holiday = ['on the 4th of July', 'on April Fools', 'during Autumn', 'on Christmas']

Statements = [Saying, Food, Holiday]

permutations = list(itertools.product(*Statements))

random.seed()

total_elements = len(Saying) * len(Food) * len(Holiday)

while total_elements > 0:
    num = int(random.random() * total_elements)
    print '{} {} {}'.format(permutations[num][0], permutations[num][1], permutations[num][2])
    del permutations[num]
    total_elements = total_elements - 1

score 0 · Answer 5 · answered Jun 22 '23 at 23:24

We create a sequence using a prime number and one of its primitive roots modulo n that visits each number in an interval exactly once. More specifically we are looking for a generator of the multiplicative group of integers modulo n.

We have to pick our prime number a little larger than the product prod([len(i) for i in iterables)], so we have to account for the cases in which we'd get index errors.

import random
from math import gcd
import math

from math import prod
from typing import Iterable


def next_prime(number):
    if number < 0:
        raise ValueError('Negative numbers can not be primes')
    if number <= 1:
        return 2
    if number % 2 == 0:
        number -= 1
    while True:
        number += 2
        max_check = int(math.sqrt(number)) + 2
        for divider in range(3, max_check, 2):
            if number % divider == 0:
                break
        else:
            return number


def is_primitive_root(a, n):
    phi = n - 1
    factors = set()
    for i in range(2, int(phi ** 0.5) + 1):
        if phi % i == 0:
            factors.add(i)
            factors.add(phi // i)
    for factor in factors:
        if pow(a, factor, n) == 1:
            return False
    return True


def find_random_primitive_root(n):
    while True:
        a = random.randint(2, n - 1)
        if gcd(a, n) == 1 and is_primitive_root(a, n):
            return a


class CoordinateMapper:
    """
        A class that maps linear indices to multi-dimensional coordinates within a specified shape.

        Args:
            dims (Iterable[int]): An iterable representing the dimensions of the desired shape.

        Example Usage:
            shape = (2, 3, 5, 4)
            mapper = CoordinateMapper(shape)
            coordinates = mapper.map(10)
            print(coordinates)  # Output: [0, 1, 2, 2]
    """

    def __init__(self, dims: Iterable[int]):
        self.moduli = [prod(dims[i:]) for i in range(len(dims))]
        self.divisors = [prod(dims[i + 1:]) for i in range(len(dims))]

    def map(self, n: int):
        return [(n % self.moduli[i]) // self.divisors[i] for i in range(len(self.moduli))]


def sampler(l):
    close_prime = next_prime(l)
    state = root = find_random_primitive_root(close_prime)
    while state > l:
        state = (state * root) % close_prime  # Inlining the computation leads to a 20% speed up

    yield state - 1
    for i in range(l - 1):
        state = (state * root) % close_prime
        while state > l:
            state = (state * root) % close_prime
        yield state - 1


def _unique_combinations(*iterables):
    cartesian_product_cardinality = prod([len(i) for i in iterables])
    coordinate_mapper = CoordinateMapper([len(i) for i in iterables])
    sequence = sampler(cartesian_product_cardinality)
    for state in sequence:
        yield tuple(iterable[coord] for iterable, coord in zip(iterables, coordinate_mapper.map(state)))

I started benchmarking the various approaches. I couldn't get any solution other than @achint's to run without assertion error:

from itertools import product
import time
approaches= {
    'prime_roots':_unique_combinations,
    'matmarbon':random_order_cartesian_product,
    'itertools.product':itertools.product,
}
a = list(range(10))
b = list(range(10))
for name, approach in approaches.items():
    assert sorted(u)==sorted(product(a,b))

For the 2 algorithms I benchmarked the following, with itertools as baseline

import pandas as pd
import timeit
import matplotlib.pyplot as plt

def benchmark(approaches, list_lengths, num_repetitions):
    results = []

    for approach, function in approaches.items():
        for length in list_lengths:
            a = list(range(length))
            b = list(range(length))
            def f():
                for item in function(a,b):
                    pass
            execution_time = timeit.timeit(f, number=num_repetitions)
            entry = {
                'Approach': approach,
                'List Length': length,
                'Execution Time': execution_time
            }
            print(entry)
            results.append(entry)

    results_df = pd.DataFrame(results)

    # Plot the benchmark results
    plt.figure(figsize=(10, 6))
    for approach in approaches.keys():
        approach_results = results_df[results_df['Approach'] == approach]
        plt.plot(approach_results['List Length'], approach_results['Execution Time'], marker='o', label=approach)

    plt.xlabel('List Length')
    plt.ylabel('Execution Time (seconds)')
    plt.title('Benchmark Results')
    plt.grid(True)
    plt.legend()
    plt.show()

list_lengths = [10,20,30,40,50,60,70,80,90,100,200,300,400,500]
num_repetitions = 3
benchmark(approaches, list_lengths, num_repetitions)

Both algorithms perform in O(n^k) for k~len(iterables) (assuming somewhat evenly sized iterables.

Memory-wise the prime roots approach wins just because only O(1) memory is required and nothing is stored.

Distribution wise the prime roots approach is not actually random but rather a difficult-to-predict-deterministic sequence. In practice the sequences should be "random" enough.

Credit to this stack overflow answer which inspired the solution.

Keep Python from Providing Duplicate Outputs

5 Answers5