3

What I tried was this:

import numpy as np
def test_random(nr_selections, n, prob):
    selected = np.random.choice(n, size=nr_selections, replace= False, p = prob)
    print(str(nr_selections) + ': ' + str(selected))

n = 100
prob = np.random.choice(100, n)
prob = prob / np.sum(prob) #only for demonstration purpose
for i in np.arange(10, 100, 10):
    np.random.seed(123)        
    test_random(i, n, prob)

The result was:

10: [68 32 25 54 72 45 96 67 49 40]
20: [68 32 25 54 72 45 96 67 49 40 36 74 46  7 21 20 53 65 89 77]
30: [68 32 25 54 72 45 96 67 49 40 36 74 46  7 21 20 53 62 86 60 35 37  8 48
     52 47 31 92 95 56] 
40: ...

Contrary to my expectation and hope, the 30 numbers selected do not contain all of the 20 numbers. I also tried using numpy.random.default_rng, but only strayed further away from my desired output. I also simplified the original problem somewhat in the above example. Any help would be greatly appreciated. Thank you!

Edit for clarification: I do not want to generate all the sequences in one loop (like in the example above) but rather use the related sequences in different runs of the same program. (Ideally, without storing them somewhere)

naxatras
  • 31
  • 3
  • 1
    Maybe just generate the longest one and then use prefixes of it? – Kelly Bundy Aug 25 '21 at 18:08
  • Thank you for the suggestion. Could this cause problems, because all elements are chosen for the longest one? – naxatras Aug 25 '21 at 18:32
  • Why would that cause problems? What kind of problems are you thinking of? – Kelly Bundy Aug 25 '21 at 18:38
  • Well, because of the non-replacement the selection probabilities for individual elements are changing constantly during the execution of random.choice. When I essentially pick all elements, i.e. random.choice(100,100, replace = False) are they still in the correct order (which is dependent on the original prob weights)? – naxatras Aug 25 '21 at 18:50
  • Ordinarily reseeding a random number generator will make it generate an identical sequence of numbers. But because you're picking from a list with probabilities, it introduces an aspect of unpredictability as your list size changes. I'm no statistician, but it's not obvious to me how to fix this. – Mark Ransom Aug 26 '21 at 01:20
  • @MarkRansom I don't think the probabilities are a problem. [This version](https://tio.run/##XZAxbsMwDEV3nYJAh8io4djNFsBrrhEoNuMIkCmBkgf38i6lJGhdDeLwyc/3Gdb08HTaNjsHzwlomcMKJgIFpUa8Q8KYrmxo9LMmvkZ0OCTrKdZANQT2t@qsQN5TwRF6YJqa4eHtgFp6ov3G/t8oY3BmwB4uxkUUH5kqXsUqsKWkY@L9xgo@4XCGg5SsvRdWlVIk813bqrtnsGBJ8Bsj1BPqrq2zlL8XqeBJu3Q8YzUS0yxOUtKku6/Tm8Hf9lGKCe3UUo7ZKi6zLgHgw5NbIYOMOAt1YpPpISwcfMQy/veo9veQ2/YD) (doesn't run there because old numpy) gets me the same numbers every time I run the script. The thing is that the same-seeded RNG gives you the same results if you request the same things, but ... – Kelly Bundy Aug 26 '21 at 03:43
  • ... requesting different lengths is requesting different things. – Kelly Bundy Aug 26 '21 at 03:43
  • @naxatras Ok, not sure about that. How the probabilities are even used. If you sample just 10%, surely the high-probability values have a higher probability to make it into the sample at all, but do their probabilities affect how *early* they make it, i.e., their order? I don't see the documentation talk about it. If it's just about whether they make it into the sample at all, then when sampling 100%, the probabilities are completely irrelevant. – Kelly Bundy Aug 26 '21 at 03:58
  • 1
    Do you care about order inside each of the ten blocks of ten values? If you do, I'm not sure `choices` is the right tool (see my previous comment). If you don't, i.e., you just care about which numbers are in a block, then for example for getting 40 values, you could call `choices` four times, requesting 10 values each, updating the remaining available values and their probabilities yourself. – Kelly Bundy Aug 26 '21 at 04:03
  • I only care about the blocks, not the order within the blocks. I tried your suggestion of requesting 10 values each and it worked at first. Thank you ! However, when I upped the number of trials and tried different numbers of values (12, 15, 20 values each), it led to weird results. The results matched about 90% of the time for 10 and 20 values each, 100% of the time for 12 values each, and 60% of the time for 15 values each. (100,000 trial runs). Argh! Any ideas? – naxatras Aug 26 '21 at 09:14
  • I also generated a random.choice(100,100,replace=False, p = prob) with different seeds 1,000,000 times and the frequency of the first element in the resulting arrays matches the original selection probability. This should indicate that the construction of the array in random.choice works as intended, I think...? – naxatras Aug 26 '21 at 10:44
  • I would report this as a bug in the numpy bug tracker if I were in your position. I assume the underlying algorithm should be something like this: https://stackoverflow.com/questions/57599509/c-random-non-repeated-integers-with-weights , which should not produce these differences, and there might be some underlying bug affecting other use cases too. The only possible situation in which I could see something like this happening is if they are somehow scaling the weights according to the sample size in order to have more or less numerical precision, but I doubt that's the case. – anymous.asker Aug 26 '21 at 23:10
  • Why the method suggested by @KellyBundy didn't work? I.e., when you generate 10 elements, then remove the generated ones from the list. You mentioned it didn't work when you generated the values by 15. Can you please share the code? – DanielTuzes Oct 16 '21 at 20:26

0 Answers0