4

I have a dataframe where I want to create random numbers in a new column. The random numbers must fulfill two constraints:

  1. The random numbers must add up to a specified sum (in the example, the sum is 300)
  2. For each observation, the random numbers must not exceed a value in the constraint column.

In the example below, the constraints are fulfilled because the sum is 300 and the random number does not exceed the constraint column.

Example:

GEOID CONSTRAINT RANDOM
010010000001 100 80
010010000002 50 40
010010000003 75 60
010010000004 75 60
010010000005 100 60

It seems having random numbers totaling a sum has been demonstrated but I do not see an example with a second constraint.

Edit for clarity: The new column must be integers. The minimum lower bound value is 0.

adin
  • 783
  • 3
  • 13
  • 27
  • 1
    What happens if the numbers cannot add up to sum due to constraints? – Dani Mesejo Sep 21 '21 at 15:37
  • @MichaelSzczesny are you suggesting floats instead? I suppose that would work if the floats could be transformed into integers with and not violate either of the constraints. – adin Sep 21 '21 at 16:03
  • @DaniMesejo then the attempt is not a solution. – adin Sep 21 '21 at 16:03
  • 2
    I expressed myself poorly. Let's try with an example: if all constraint equals 10 and the sum is 300 then there is no solution for your problem – Dani Mesejo Sep 21 '21 at 16:07
  • @adin So you're confirming to Michael that the outcomes have to be integers? It's a yes/no question. – pjs Sep 21 '21 at 16:21
  • 1
    @MichaelSzczesny I think so too, but by not actually saying "yes" the response remained ambiguous so I'd like to pin it down. There's also been no discussion of whether there's a lower bound. Can values go negative? – pjs Sep 21 '21 at 16:25
  • 3
    I think something like: `res = rng.multinomial(300, df["CONSTRAINT"] / df["CONSTRAINT"].sum(), size=1)` may work well for integers – Dani Mesejo Sep 21 '21 at 16:28
  • Does this answer your question? [Is there an efficient way to generate N random integers in a range that have a given sum or average?](https://stackoverflow.com/questions/61393463/is-there-an-efficient-way-to-generate-n-random-integers-in-a-range-that-have-a-g) – Peter O. Sep 21 '21 at 16:34
  • In particular, the first edit to this answer: https://stackoverflow.com/a/61525097/815724 – Peter O. Sep 21 '21 at 16:35
  • 1
    @adin According to your current wording, solutions [100, -50, 75, 75, 100], [100, 50, -25, 75, 100], or [100, 50, 75, -25, 100] would all meet your requirements. Is that correct, or are there lower bounds as well? – pjs Sep 21 '21 at 16:36
  • 2
    @DaniMesejo - That's brilliant, I only get 2% violations of the second constraint. That's very well suited for rejection sampling with 98% success on first try. – Michael Szczesny Sep 21 '21 at 16:39
  • @pjs thank you for bringing that up, I edited the original question for clarity that the lower bound is zero. – adin Sep 21 '21 at 16:49
  • 3
    @DaniMesejo can you please put `res = rng.multinomial(300, df["CONSTRAINT"] / df["CONSTRAINT"].sum(), size=1)` in a formal answer and explain how it works? I tried 100 groups of 1000 attempts using that answer and found that on average, I rejected 23% of the results because one of the results was greater than a constraint. Since I can check for valid attempts, and each iteration was quick, this solution works for me. – adin Sep 21 '21 at 19:17

1 Answers1

2

You could use the multinomial distribution to build an approximate answer:

def sample(total, constraints):
    import numpy as np
    rng = np.random.default_rng()
    samples = rng.multinomial(total, constraints / constraints.sum(), size=100)
    return next(val for val in samples if np.all(val < constraints))


df["RANDOM"] = sample(300, df["CONSTRAINT"].values)
print(df)

Output

             GEOID  CONSTRAINT  RANDOM
0  10010000001         100      81
1  10010000002          50      42
2  10010000003          75      57
3  10010000004          75      53
4  10010000005         100      67

Thanks goes to @Michael Szczesny for testing the solution.

The key to solve this, relies in (quote from numpy docs):

Its values, X_i = [X_0, X_1, ..., X_p], represent the number of times the outcome was i.

see more details in this blog post.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76