2

I want a sample through

random.sample(range(11111111111,99999999999), len(df.index))

However, I have list of values I do not want to be included in the sample, and therefore want to remove it from the range before.

Problem: Converting the range to a list for filtering results in a Memory Error - I believe it takes to much memory

Any suggestions?

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
oluo.acn
  • 39
  • 2
  • 2
    You might also get interested in taking a look at https://stackoverflow.com/questions/12581437/python-random-sample-with-a-generator-iterable-iterator – RandomGuy Apr 20 '23 at 10:23
  • Is that your actual range? How large is `len(df.index))`? How large is the exclude-list, and how many unique values does it have in the range? – Kelly Bundy Apr 20 '23 at 21:19

3 Answers3

0

First, define a generator to create an infinite sequence of random numbers using yield so the numbers are calculated only once they are needed:

from random import randint

exclusion_set = {11111111111, 22222222222, 33333333333}

def random_number_generator(min_value : int, max_value : int):
    while True:
        i = randint(min_value, max_value)
        if i not in exclusion_set:
            yield i

Then, use itertools.islice() to generate as many random numbers as needed.

Full code:

from random import randint
from itertools import islice
from time import perf_counter

min_value = 11111111111
max_value = 99999999999
num_samples = 1000000
exclusion_set = {11111111111, 22222222222, 33333333333}

def random_number_generator(min_value : int, max_value : int):
    while True:
        i = randint(min_value, max_value)
        if i not in exclusion_set:
            yield i

start_time = perf_counter()
result = islice(random_number_generator(min_value, max_value), num_samples)
result = list(result)
end_time = perf_counter()

#print(result)
print(len(result))
print(end_time - start_time)

Notes:

  • The way the random numbers are calculated is probably improvable.

  • For more info, as suggested by RandomGuy, check this question.


If you need the result list to not contain duplicated values, add a "seen" set in the generator to keep track of already generated numbers:

def random_number_generator(min_value : int, max_value : int):
    seen = set()
    while True:
        i = randint(min_value, max_value)
        if i not in exclusion_set and i not in seen:
            seen.add(i)
            yield i
Jabro
  • 480
  • 1
  • 2
  • 10
0

You can obtain a O(NlogX) solution (where N is sample size and X is number of exclusions) by building a list of offsets and reducing your sample range by the number of excluded values. Then use a binary search to adjust the sample values back to the full range by applying the offsets to increase them (i.e. performing the skips):

import random
from bisect import bisect_right

minValue    = 111
maxValue    = 999
exclusions  = [121, 131, 141, 151] # must be sorted
offsets     = [n-i for i,n in enumerate(exclusions)]
sampleSize  = 15

def getSample():
    S = random.sample(range(minValue,maxValue+1-len(exclusions)), sampleSize)
    return [n + bisect_right(offsets,n) for n in S]

output:

print(getSample())
[574, 891, 953, 210, 331, 253, 117, 845, 572, 885, 699, 799, 563, 857, 532]

checking for validity of the output:

excludedSet = set(exclusions)
for _ in range(10000):
    s = getSample()
    if excludedSet.intersection(s):
        print("failed exclusions",s)
        break
else:
    print("all samples are compliant")

--> all samples are compliant

Note that this assumes that the number of excluded values is reasonable. If you are excluding whole subsets or patterns of values, a different solution would probably be better. You would have to be more specific in your question though.

Alain T.
  • 40,517
  • 4
  • 31
  • 51
-1

If performance is not critical, you can keep sampling integers in the desired range and skipping them if they are undesired, as follows:

import random

skip = {111_111_111, 321_321_321}
target_samples = 10
samples = []
while len(samples) < target_samples:
    n = random.randint(111_111_111, 999_999_999)
    if n not in skip:
        samples.append(n)
print(samples)
  • 2
    Use a set for `skip`, as linear search in a list is extremely inefficient – RandomGuy Apr 20 '23 at 10:14
  • That is actually the approach I come up with for now, however I feel like it is not really elegant and hoped for a better solution – oluo.acn Apr 20 '23 at 10:15
  • @oluo.acn Yeah, I'm not sure if there is an elegant solution. There are some ways to exclude values from `range` discussed [here](https://stackoverflow.com/questions/24089924/skip-over-a-value-in-the-range-function-in-python), with perhaps `itertools.chain` looking the most promising, depending on the distribution of your skip values. – Yazeed Alnumay Apr 20 '23 at 10:20