Excluding list of values from range with large numbers

Question

I want a sample through

random.sample(range(11111111111,99999999999), len(df.index))

However, I have list of values I do not want to be included in the sample, and therefore want to remove it from the range before.

Problem: Converting the range to a list for filtering results in a Memory Error - I believe it takes to much memory

Any suggestions?

You might also get interested in taking a look at https://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator-iterable-iterator — RandomGuy, Apr 20 '23 at 10:23
Is that your actual range? How large is `len(df.index))`? How large is the exclude-list, and how many unique values does it have in the range? — Kelly Bundy, Apr 20 '23 at 21:19

Jabro · Answer 1 · 2023-05-11T18:42:31.943

First, define a generator to create an infinite sequence of random numbers using yield so the numbers are calculated only once they are needed:

from random import randint

exclusion_set = {11111111111, 22222222222, 33333333333}

def random_number_generator(min_value : int, max_value : int):
    while True:
        i = randint(min_value, max_value)
        if i not in exclusion_set:
            yield i

Then, use itertools.islice() to generate as many random numbers as needed.

Full code:

from random import randint
from itertools import islice
from time import perf_counter

min_value = 11111111111
max_value = 99999999999
num_samples = 1000000
exclusion_set = {11111111111, 22222222222, 33333333333}

def random_number_generator(min_value : int, max_value : int):
    while True:
        i = randint(min_value, max_value)
        if i not in exclusion_set:
            yield i

start_time = perf_counter()
result = islice(random_number_generator(min_value, max_value), num_samples)
result = list(result)
end_time = perf_counter()

#print(result)
print(len(result))
print(end_time - start_time)

Notes:

The way the random numbers are calculated is probably improvable.
For more info, as suggested by RandomGuy, check this question.

If you need the result list to not contain duplicated values, add a "seen" set in the generator to keep track of already generated numbers:

def random_number_generator(min_value : int, max_value : int):
    seen = set()
    while True:
        i = randint(min_value, max_value)
        if i not in exclusion_set and i not in seen:
            seen.add(i)
            yield i

Alain T. · Answer 2 · 2023-04-20T22:51:46.150

You can obtain a O(NlogX) solution (where N is sample size and X is number of exclusions) by building a list of offsets and reducing your sample range by the number of excluded values. Then use a binary search to adjust the sample values back to the full range by applying the offsets to increase them (i.e. performing the skips):

import random
from bisect import bisect_right

minValue    = 111
maxValue    = 999
exclusions  = [121, 131, 141, 151] # must be sorted
offsets     = [n-i for i,n in enumerate(exclusions)]
sampleSize  = 15

def getSample():
    S = random.sample(range(minValue,maxValue+1-len(exclusions)), sampleSize)
    return [n + bisect_right(offsets,n) for n in S]

output:

print(getSample())
[574, 891, 953, 210, 331, 253, 117, 845, 572, 885, 699, 799, 563, 857, 532]

checking for validity of the output:

excludedSet = set(exclusions)
for _ in range(10000):
    s = getSample()
    if excludedSet.intersection(s):
        print("failed exclusions",s)
        break
else:
    print("all samples are compliant")

--> all samples are compliant

Note that this assumes that the number of excluded values is reasonable. If you are excluding whole subsets or patterns of values, a different solution would probably be better. You would have to be more specific in your question though.

Yazeed Alnumay · Answer 3 · 2023-04-20T10:15:25.253

-1

If performance is not critical, you can keep sampling integers in the desired range and skipping them if they are undesired, as follows:

import random

skip = {111_111_111, 321_321_321}
target_samples = 10
samples = []
while len(samples) < target_samples:
    n = random.randint(111_111_111, 999_999_999)
    if n not in skip:
        samples.append(n)
print(samples)

edited Apr 20 '23 at 10:15

answered Apr 20 '23 at 10:13

Yazeed Alnumay

185
8

2

Use a set for `skip`, as linear search in a list is extremely inefficient – RandomGuy Apr 20 '23 at 10:14
That is actually the approach I come up with for now, however I feel like it is not really elegant and hoped for a better solution – oluo.acn Apr 20 '23 at 10:15
@oluo.acn Yeah, I'm not sure if there is an elegant solution. There are some ways to exclude values from `range` discussed [here](https://stackoverflow.com/questions/24089924/skip-over-a-value-in-the-range-function-in-python), with perhaps `itertools.chain` looking the most promising, depending on the distribution of your skip values. – Yazeed Alnumay Apr 20 '23 at 10:20

Excluding list of values from range with large numbers

3 Answers3