Pythonic way to generate a list of a certain size with no duplicates?

Question

I'm trying to generate a list of (x, y) tuples of size num_cities with the constraint that no two tuples are the same. Is there a shorter, Pythonic way to do this using a set comprehension or itertools? I currently have:

def make_random_cities(num_cities, max_x, max_y):    
    cities = set()
    while len(cities) < num_cities:
        x, y = randint(0, max_x), randint(0, max_y)
        cities.add((x, y))
    return list(cities)

ShadowRanger · Accepted Answer · 2016-03-01T19:40:22.847

If the maximum values aren't too large to store the complete set of possibilities in memory (and it won't take forever to generate them), random.sample and itertools.product can be used effectively here:

import itertools
import random

def make_random_cities(num_cities, max_x, max_y):
    return random.sample(list(itertools.product(range(max_x+1), range(max_y+1))), num_cities)

If the product of the inputs gets too large though, you could easily exceed main memory; in that case, your approach of looping until you get sufficient unique results is probably the best approach.

You could do samples of each range independently and then combine them together, but that would add uniqueness constraints to each axis, which I'm guessing you don't want.

For this specific case (unique numbers following a predictable pattern), you could use a trick to make this memory friendly while still avoiding the issue of arbitrarily long loops. Instead of taking the product of two ranges, you'd generate a single range (or in Py2, xrange) that encodes both unique values from the product in a single value:

def make_random_cities(num_cities, max_x, max_y):
    max_xy = (max_x+1) * (max_y+1)
    xys = random.sample(range(max_xy), num_cities)
    return [divmod(xy, max_y+1) for xy in xys]

This means you have no large intermediate data to store (because Py3 range/Py2 xrange are "virtual" sequences, with storage requirements unrelated to the range of values they represent, and random.sample produces samples without needing to read all the values of the underlying sequence).

score 2 · Answer 2 · answered Mar 01 '16 at 00:50

Your current code is probably good if the number of cities is much smaller than max_x * max_y. If they're closer together though, it may waste a lot of time generating duplicate coordinates.

An alternative approach would be to generate all possible coordinates and then sample from them:

possible_coords = list(itertools.product(range(max_x), range(max_y))
sample = random.sample(possible_coords, len(cities))

Generating the list of will always take O(max_x * max_y), but it won't get worse if the number of cities increases.

Pythonic way to generate a list of a certain size with no duplicates?

2 Answers2