What is the fastest way to check for existence and get random element from data structure in python?

Question

I need a data structure (preferably a built in python type) that allows for existence check in O(1) and also random element choosing in O(1). The elements to store in the structure are unique integers. I have considered this options:

list: O(1) random choice with random.choice, but O(n) existence with in operator.
set: O(1) existence check with in operator, but can't use random.choice without casting to tuple (O(n))
tuple: O(1) random choice with random.choice, but O(n) check for existence
dictionary: O(1) existence check , but memory use is at least 2 times as much compared to the other options (keys and values) , and random.choice does not work without casting to list or getting items (assumably O(n) operations).

Do I need to implement a compound data structure? What is the best approach? PD: I am using Python 3 by the way.

EDIT:

The data structure is created from a list comprehension with a range generator, so the data inside the sequence is always ordered. Something like this:

data=[x for x in range(n) if condition]

What other operations does the data structure need? E.g. do you need to be able to add new elements, remove them, and so on? Or just build a data structure with a fixed set of values? Is there a known range of values that the integers are within? — kaya3, Feb 20 '20 at 13:50
@kaya3 Hi, only random choice and check for existence is needed, is not necessary to add new values. Removing elements is optional, because the data structure gets discarded on every use. — Juan David, Feb 20 '20 at 13:52
Discarded on every use? If you're doing the operation(s) just once, then I don't see why it should matter if those operations are O(n) if you can do them without building the data structure in O(n) time. Or am I misunderstanding? — kaya3, Feb 20 '20 at 13:55
@kaya3 The creation and operation of the data structure is repeated over many cycles, so an O(n) operation in between add up. — Juan David, Feb 20 '20 at 13:58
I see. But if you could do it in O(n) time without building a data structure, wouldn't that be just as good as building a data structure in O(n) time and then doing one O(1) operation? — kaya3, Feb 20 '20 at 13:59
It's not possible, data between iterations is different, the data structure has to be instantiated every time, is merely a container for the generated data. @kaya3 — Juan David, Feb 20 '20 at 14:10
Have a look at https://boltons.readthedocs.io/en/latest/setutils.html - it seems someone already implemented what you want. — Błotosmętek, Feb 20 '20 at 14:13
I see your edit... if you know the condition, why not use the condition itself to test membership? — kaya3, Feb 20 '20 at 14:30

kaya3 · Accepted Answer · 2020-02-20T14:51:48.363

There are a few approaches for this. It depends how important it is to minimise memory usage. From the comments you've said it's not necessary to be able to add or remove elements.

Implicit set

For the special case where you know both the range that the numbers are drawn from, and the condition for them to be included in the set, and you also only need to do small number of membership and sampling operations, then you don't need to build any data structure at all. You can use the condition to test membership, and the range to sample from it:

import random

# example usage:
# >>> s = ImplicitSet(range(10), lambda x: x > 4)

class ImplicitSet:
    def __init__(self, range, condition):
        self.range = range
        self.condition = condition
    def contains(self, x):
        return x in self.range and self.condition(x)
    def sample(self):
        # you'll need to change this if the condition might never be true
        while True:
            x = random.choice(range)
            if self.condition(x):
                return x

Unfortunately you can't just generate one random number and iterate until you find a member of the set, since that would bias the sampling towards numbers with bigger "gaps" before them.

This takes O(1) memory, because you aren't storing any actual data structure. The contains operation takes whatever time it takes to test the condition, but if you're only testing membership for a small number of elements, it saves time compared to a list comprehension which tests the condition for every element of the range.

The sample operation requires a number of tries inversely proportional to the density of elements satisfying the condition within the range, so e.g. if the condition is true for 25% of the elements then you need four tries on average. So sampling takes constant time, but the size of the constant depends on the distribution of the actual numbers. Again, if you're only sampling a few times then this is much better than a list comprehension having to test the condition for every number in the range.

List and set

This way is the simplest to implement, and gives both membership tests (using the set) and random sampling (using the list) in constant time. However, it stores each integer in two containers instead of one, using more memory.

import random

class ListAndSet:
    def __init__(self, numbers):
        self.numbers_list = list(numbers)
        self.numbers_set = set(self.numbers_list) # numbers may be a generator
    def contains(self, x):
        return x in self.numbers_set
    def sample(self):
        return random.choice(self.numbers_list)

Sorted list

This one is also pretty simple to implement, and it uses the least possible memory unless your numbers are within a fixed size, in which case a sorted array is better. It takes O(n log n) time to sort the numbers in the first place, but after that, membership tests take O(log n) time using a binary search, and sampling takes O(1) time.

import bisect
import random

class SortedList:
    def __init__(self, numbers):
        self.sorted_list = sorted(numbers)
    def contains(self, x):
        index = bisect.bisect_left(self.sorted_list, x)
        return index < len(self.sorted_list) and self.sorted_list[index] == x
    def sample(self):
        return random.choice(self.sorted_list)

Dense set

If the numbers are densely packed within some range, you could store them in a set, and then sample by generating random numbers in that range until you find one that's in the set. The expected number of tries is inversely proportional to the density of the numbers within the range, so e.g. if you have 1,000 numbers within a range of size 4,000, you will need four tries on average.

import random

class DenseSet:
    def __init__(self, numbers):
        self.numbers_set = set(numbers)
        # you'll need to change this if the set might sometimes be empty
        self.low = min(self.numbers_set)
        self.high = max(self.numbers_set)
    def contains(self, x):
        return x in self.numbers_set
    def sample(self):
        # you'll need to change this if the set might sometimes be empty
        while True:
            x = random.randint(self.low, self.high)
            if x in self.numbers_set:
                return x

Your own hash-set

If O(log n) time isn't good enough, and the numbers are sparse, but you absolutely can't afford the memory required for both a list and a set, then you could implement your own hash-set; you can find example code by searching online. To sample in O(1) expected time, repeatedly generate random indices in the underlying list until you find a slot that is non-empty. This guarantees a uniform distribution of the samples. The expected number of tries is inversely proportional to the hash-set's load factor, so e.g. with a load factor of 0.8 you would need on average 1.25 tries.

Unfortunately you can't just generate one index and then iterate through until you find a non-empty slot, since the sampling would be biased towards elements with more empty slots behind them.

Hi, I have edited the question, initial generator is sorted, so I guess I could skip the sort part of your SortedList answer and assume it. I think it's the best answer because it achieves O(n) instantiation, O(1) random choice and O(log n) search.Thanks. — Juan David, Feb 20 '20 at 14:36

Pablo Gutierrez Marques · Answer 2 · 2020-02-20T14:44:46.257

1

Unless you have strong memory constraints, I would go with a list and a set, that you keep synchronized. You check with the set for existence but pick at random from the list.

class RandomChoice:
    def __init__(self, data):
        self.list = list(data)
        self.set = set(self.list)

    def add(self, element):
        if element not in self.set:
            self.list.append(element)
            self.set.add(element)

    def contains(self, element):
        return self.set.contains(element)

    def get_random(self):
        return random.choice(self.list)

Of course it takes twice the storage than a single list or a single set, but it does the job.

edited Feb 20 '20 at 14:44

answered Feb 20 '20 at 14:12

Pablo Gutierrez Marques

432
2
8

I have edited the question, this is a good answer, but since the data is generated from a list comprehension, it is time expensive to add elements to two different data structures. – Juan David Feb 20 '20 at 14:31
@JuanDavid You could easily adapt it to use the list that the list comprehension generates, instead of building a new list. – kaya3 Feb 20 '20 at 14:32
@kaya3 Could you elaborate please? – Juan David Feb 20 '20 at 14:37
1

@JuanDavid By passing data to the constructor of the class, you do not have to call add on every single element. Instead, the list is created from your original list (or a generator) and the corresponding set. – Pablo Gutierrez Marques Feb 20 '20 at 14:46