1

I had an application that required something similar to the problem described here.

I too need to generate a set of positive integer random variables {Xi} that add up to a given sum S, where each variable might have constraints such as mi<=Xi<=Mi.

This I know how to do, the problem is that in my case I also might have constraints between the random variables themselves, say Xi<=Fi(Xj) for some given Fi (also lets say Fi's inverse is known), Now, how should one generate the random variables "correctly"? I put correctly in quotes here because I'm not really sure what it would mean here except that I want the generated numbers to cover all possible cases with as uniform a probability as possible for each possible case.

Say we even look at a very simple case: 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, what would be the "correct" way to generate them?

P.S. I know that this seems like it would be a better fit for math overflow but I found no solutions there either.

Veltzer Doron
  • 934
  • 2
  • 10
  • 31
  • 1
    Integer random variables? – Severin Pappadeux Dec 13 '20 at 16:17
  • 1
    Have you considered simply drawing random variables in the given ranges, and rejecting the sets that don't satisfy the constraints? – Beta Dec 13 '20 at 16:26
  • 2 considerations come to mind: first, you just need to draw 3 variables, the third is 100-(x1 + x2 + x3), second, if you start by drawing the second variable, you can draw the first in the range [0, 2*x2]. Clearly you would have to check a bunch of ranges before you get one that works, but I guess less than just drawing the variables completely at random. A good thing about this approach is that you can use different distributions for the variables. – gionni Dec 13 '20 at 17:21
  • 1
    @gionni: No, that would give you a higher density at low x2. – Beta Dec 13 '20 at 17:38
  • @Beta: i guess you are referring to my second observation, but I don't see why, although it seems reasonable for some reason :P – gionni Dec 13 '20 at 18:01
  • @gionni: read the other question I linked to or better yer, this answer https://stackoverflow.com/a/8068956/374437 – Veltzer Doron Dec 13 '20 at 21:06
  • 1
    Yes, for the constraint `x1<=x2` you can swap them *if their ranges are the same* and not waste any draws. You can use a similar trick for the constraint `x1<=2*x2` if the range of x1 is twice as wide as the range of x2. A way to avoid wasting draws *in general* is not obvious (at least to me). – Beta Dec 14 '20 at 02:16

3 Answers3

2

For 4 random variables X1,X2,X3,X4 that need to add up to 100 and comply with the constraint X1 <= 2*X2, one could use multinomial distribution

As soon as probability of the first number is low enough, your condition would be almost always satisfied, if not - reject and repeat. And multinomial distribution by design has the sum equal to 100.

Code, Windows 10 x64, Python 3.8

import numpy as np

def x1x2x3x4(rng):
    while True:
        v = rng.multinomial(100, [0.1, 1/2-0.1, 1/4, 1/4])
        if v[0] <= 2*v[1]:
            return v

    return None

rng = np.random.default_rng()

print(x1x2x3x4(rng))
print(x1x2x3x4(rng))
print(x1x2x3x4(rng))

UPDATE

Lots of freedom in selecting probabilities. E.g., you could make other (##2, 3, 4) symmetric. Code

def x1x2x3x4(rng, pfirst = 0.1):
    pother = (1.0 - pfirst)/3.0
    while True:
        v = rng.multinomial(100, [pfirst, pother, pother, pother])
        if v[0] <= 2*v[1]:
            return v

    return None

UPDATE II

If you start rejecting combinations, then you artificially bump probabilities of one subset of events and lower probabilities of another set of events - and total sum is always 1. There is NO WAY to have uniform probabilities with conditions you want to meet. Code below runs with multinomial with equal probabilities and computes histograms and mean values. Mean supposed to be exactly 25 (=100/4), but as soon as you reject some samples, you lower mean of first value and increase mean of the second value. Difference is small, but UNAVOIDABLE. If it is ok with you, so be it. Code

import numpy as np
import matplotlib.pyplot as plt

def x1x2x3x4(rng, summa, pfirst = 0.1):
    pother = (1.0 - pfirst)/3.0
    while True:
        v = rng.multinomial(summa, [pfirst, pother, pother, pother])
        if v[0] <= 2*v[1]:
            return v
    return None

rng = np.random.default_rng()

s = 100
N = 5000000

# histograms
first = np.zeros(s+1)
secnd = np.zeros(s+1)
third = np.zeros(s+1)
forth = np.zeros(s+1)

mfirst = np.float64(0.0)
msecnd = np.float64(0.0)
mthird = np.float64(0.0)
mforth = np.float64(0.0)

for _ in range(0, N): # sampling with equal probabilities
    v = x1x2x3x4(rng, s, 0.25)

    q = v[0]
    mfirst   += np.float64(q)
    first[q] += 1.0

    q = v[1]
    msecnd   += np.float64(q)
    secnd[q] += 1.0

    q = v[2]
    mthird   += np.float64(q)
    third[q] += 1.0

    q = v[3]
    mforth   += np.float64(q)
    forth[q] += 1.0

x = np.arange(0, s+1, dtype=np.int32)

fig, axs = plt.subplots(4)
axs[0].stem(x, first, markerfmt=' ')
axs[1].stem(x, secnd, markerfmt=' ')
axs[2].stem(x, third, markerfmt=' ')
axs[3].stem(x, forth, markerfmt=' ')
plt.show()

print((mfirst/N, msecnd/N, mthird/N, mforth/N))

prints

(24.9267492, 25.0858356, 24.9928602, 24.994555)

NB! As I said, first mean is lower and second is higher. Histograms are a little bit different as well

enter image description here

UPDATE III

Ok, Dirichlet, so be it. Lets compute mean values of your generator before and after the filter. Code

import numpy as np

def generate(n=10000):
    uv = np.hstack([np.zeros([n, 1]),
                    np.sort(np.random.rand(n, 2), axis=1),
                    np.ones([n,1])])
    return np.diff(uv, axis=1)

a = generate(1000000)

print("Original Dirichlet sample means")
print(a.shape)
print(np.mean((a[:, 0] * 100).astype(int)))
print(np.mean((a[:, 1] * 100).astype(int)))
print(np.mean((a[:, 2] * 100).astype(int)))

print("\nFiltered Dirichlet sample means")
q = (a[(a[:,0]<=2*a[:,1]) & (a[:,2]>0.35),:] * 100).astype(int)
print(q.shape)

print(np.mean(q[:, 0]))
print(np.mean(q[:, 1]))
print(np.mean(q[:, 2]))

I've got

Original Dirichlet sample means
(1000000, 3)
32.833758
32.791228
32.88054

Filtered Dirichlet sample means
(281428, 3)
13.912784086871243
28.36360987535
56.23109285501087

Do you see the difference? As soon as you apply any kind of filter, you alter the distribution. Nothing is uniform anymore

Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • Multinomial is not uniform, I was looking for a uniform distribution, i.e. find a distribution that gives each possible valid tuple of Xi the same probability density. Also... Not sure why your probabilities aren't equal. – Veltzer Doron Dec 13 '20 at 23:52
  • 2
    @VeltzerDoron First condition (sum == 100) is easy to get with equal probabilities. second condition (X1 <= 2*X2) will make probabilities unequal. – Severin Pappadeux Dec 13 '20 at 23:56
  • Actually, I already checked and the solution from https://stackoverflow.com/a/8068956/374437 works in generating a uniform distribution if I filter the solutions that don't meet with the constraints. Only question I have left (for the general case since in my case I only have linear conditions between my variables and that isn't so bad when the generation is random) is if I can forego the resampling till the conditions are met in order to speed up the generation process. – Veltzer Doron Dec 14 '20 at 00:06
  • @VeltzerDoron If you filter the solutions (aka rejection method which is present in my code as well), you WON'T get uniform distribution, whatever you try - multinomial for integers or Dirichlet for floats. I put an update, please check it. – Severin Pappadeux Dec 14 '20 at 00:20
  • I don't mean uniform on the whole simplex, I mean uniform on the valid values, and I've plotted a distribution histogram, it is uniform. And I'm still not getting the reason why you are giving the first random variable a lower probability. If it was done to let the loop finish faster than it is much faster if the distribution you limit is uniform to begin with (it is about 3 times as slow with these constraints) I'll write my code in the answers. I only have one question now, can the filter be made redundant? – Veltzer Doron Dec 14 '20 at 00:26
  • @VeltzerDoron No, they are not, I just plotted histograms and mean values as well. It just cannot be – Severin Pappadeux Dec 14 '20 at 00:27
0

Ok, so I have this solution for my actual question where I generate 9000 triplets of 3 random variables by joining zeros to sorted random tuple arrays and finally ones and then taking their differences as suggested in the answer on SO I mentioned in my original question.

Then I simply filter out the ones that don't match my constraints and plot them.

S = 100

def generate(n=9000):
    uv = np.hstack([np.zeros([n, 1]),
                    np.sort(np.random.rand(n, 2), axis=1),
                    np.ones([n,1])])
    return np.diff(uv, axis=1)

a = generate()

def plotter(a):
    fig = plt.figure(figsize=(10, 10), dpi=100)
    ax = fig.add_subplot(projection='3d')

    surf = ax.scatter(*zip(*a), marker='o', color=a / 100)
    ax.view_init(elev=25., azim=75)
    
    ax.set_xlabel('$A_1$', fontsize='large', fontweight='bold')
    ax.set_ylabel('$A_2$', fontsize='large', fontweight='bold')
    ax.set_zlabel('$A_3$', fontsize='large', fontweight='bold')
    lim = (0, S);
    ax.set_xlim3d(*lim);
    ax.set_ylim3d(*lim);
    ax.set_zlim3d(*lim)
    plt.show()

b = a[(a[:, 0] <= 3.5 * a[:, 1] + 2 * a[:, 2]) &\
      (a[:, 1] >= (a[:, 2])),:] * S
plotter(b.astype(int))

enter image description here

As you can see, the distribution is uniformly distributed over these arbitrary limits on the simplex but I'm still not sure if I could forego throwing away samples that don't adhere to the constraints (work the constraints somehow into the generation process? I'm almost certain now that it can't be done for general {Fi}). This could be useful in the general case if your constraints limit your sampled area to a very small subarea of the entire simplex (since resampling like this means that to sample from the constrained area a you need to sample from the simplex an order of 1/a times).

If someone has an answer to this last question I will be much obliged (will change the selected answer to his).

Veltzer Doron
  • 934
  • 2
  • 10
  • 31
  • 1
    I don't mind running Dirichlet, and put another update in my answer, please take a look at it. `As you can see the distribution is uniformly distributed over these arbitrary limits of the simplex` - no, it is NOT. You could get alongside the mean values std.deviation, higher momenta which all will tell you those numbers are distributed differently. – Severin Pappadeux Dec 14 '20 at 01:44
  • Uniformly doesn't mean that the variables have the same distribution as each other (obviously that is impossible if one has to be more than the other, in my case they don't even have the same range as in the plot), it means that each legal point in the distribution has the same probabilistic weight/density. – Veltzer Doron Dec 14 '20 at 06:30
  • 1
    This is simplex we're talking about. Uniform here means uniform density in the simplex. And simplex is complete symmetric wrt axes. It means that X,Y,Z (in 3D, same in higher D) have THE SAME MARGINAL DISTRIBUTIONS, same mean, same momenta etc. You could just rename the axes and outcome supposed to be the same. Of you could reshuffle output with the same outcome. But as soon as you discovered that X, Y and Z have different mean and other momenta, uniformity in the simplex goes out of the window. You do not have uniform in simplex points anymore. – Severin Pappadeux Dec 14 '20 at 14:04
  • The idea was to sample the constrained space evenly so as to make sure all parts of it are covered. – Veltzer Doron Dec 15 '20 at 04:09
0

I have an answer to my question, under a general set of constraints what I do is:

  • Sample the constraints in order to evaluate s, the constrained area.
  • If s is big enough then generate random samples and throw out those that do not comply to the constraints as described in my previous answer.
  • Otherwise:
    1. Enumerate the entire simplex.
    2. Apply the constraints to filter out all tuples outside the constrained area.
    3. List the resulting filtered tuples.
    4. When asked to generate, I generate by choosing uniformly from this result list. (note: this is worth my effort only because I'm asked to generate very often)
  • A combination of these two strategies should cover most cases.

Note: I also had to handle cases where S was a randomly generated parameter (m < S < M) in which case I simply treat it as another random variable constrained between m and M and I generate it together with the rest of the variables and handle it as I described earlier.

Veltzer Doron
  • 934
  • 2
  • 10
  • 31