Generate two sequences of random numbers with fixed mean and sd, with ordering constraint

Question

I want to generate two sequences S and T of (pseudo) random normally distributed numbers with the following constraints:

S has mean m_S, standard deviation sd_S, and lenght N
T has mean m_T, standard deviation sd_T, and length N
for each i, T(i) <= S(i), i.e. an element at position i in the sequence T is smaller than or equal the corresponding element in position i in sequence S

Ideally, I would like to generate the two sequences incrementally (one pair or numbers at a time).

I know how to generate S or T independently (for example in Java mean + stdev * ThreadLocalRandom.current().nextGaussian();- see also this question), but I don't know how to generate S and T satisfying the third constrain.

Any help is appreciated!

Outside of constraint 3 which is clearly absolute, must the results have exactly those properties, or statistically appropriate approximations of those properties? — sh1, Mar 07 '17 at 22:46
@sh1 statistically appropriate approximations of those properties are OK. — MarcoS, Mar 08 '17 at 09:34

pjs · Accepted Answer · 2017-03-10T15:42:36.043

As worded, this is not mathematically possible. Let D = S - T. Your constraint that T <= S means S - T = D >= 0. Since S and T are normally distributed, so is D because linear combinations of normals are normal. You can't have a normal distribution with a finite lower bound. Consequently, you can't simultaneously meet requirements of normality for S and T and meet your constraint.

You can construct non-normal solutions by generating T with any distribution of your choice, generating D independently with a strictly positive distribution (such as gamma, Weibull, uniform(0,?), truncated normal,...), and creating S = T + D. Since expectations and variances of independent random variables sum, you can get your desired mean and s.d. for both T and S by tuning the parameterization of D appropriately. The results can even look pretty bell-shaped, but strictly speaking won't be normal.

Since variances of independent random variables are additive and must be positive, S = T + D only works if the variance of S is larger than the variance of T. The more general solution is to generate whichever of S and T has the smaller variance. If it's T, add D to get S. If it's S, subtract D to get T.

Since you said in comments that approximations are okay, here's an example. Suppose you want the smaller distribution to have a μ_smaller = 10 and σ_smaller = 3, and the larger to have μ_larger = 15 and σ_larger = 5. Then the difference between them should have a strictly positive distribution with μ_delta = 5 and σ_delta = 4 (σ_larger = sqrt(3² + 4²) = 5). I chose a gamma distribution for delta, parameterized to have the desired mean and standard deviation. Here it is in Python:

import random

alpha = 25.0 / 16.0
beta = 16.0 / 5.0
for _ in range(100000):
    smaller = random.gauss(10, 3)
    delta = random.gammavariate(alpha, beta)
    print(smaller, delta, smaller + delta)

I saved the results to a file, and imported them into JMP. Here's a snapshot of my analysis:

As you can see, smaller and larger have the desired means and standard deviations. You can also confirm that all of the deltas are positive, so larger is always greater than smaller. Finally, the normal q-q plot above larger's histogram shows that the result, while unimodal and roughly bell-shaped, is not normal because the plotted points don't fall along a straight line.

Another answer has proposed matching the two distributions by generating a single random uniform and using it as the input for inversion with both CDFs:

q = random()
t = inverseCDF(q, mu_T, sd_T)
s = inverseCDF(q, mu_S, sd_S)

This is a well-known correlation induction strategy called "Common Random Numbers" in which q is the same quantile being used to generate both distributions via inversion. With symmetric distributions, such as the normal, this produces a correlation of 1 between T and S. A correlation of 1 tells us that (either) one is a linear transformation of the other.

In fact, there's a simpler way to accomplish this for normals without having to do two inversions. Generate one of T or S, by whatever mechanism you wish—inversion, polar method, Ziggurat method, etc. Then use the standard transformation to convert it to a standard normal, and from there to the other normal. If we let T be the normal that we generate directly, then

S = (σ_S / σ_T) * (T - μ_T) + μ_S.

We would like to have T <= S for all possible quantiles to meet the objectives of the original problem. So under what circumstances can we have S < T? Since S is a function of T, that implies

(σ_T/σ_T) * (T - μ_T) + μ_S < T

which after some algebra becomes

T * (σ_S - σ_T) / σ_S < μ_T - μ_S * (σ_T/σ_S).

This reduces to 3 cases.

σ_S = σ_T: In this case, T gets eliminated and the originally desired outcome of T <= S is achieved as long as μ_T <= μ_S.
σ_S > σ_T: In this case, T > S when T < (μ_T * σ_S / (σ_S - σ_T)) - (μ_S * σ_T / (σ_S - σ_T)).
σ_S < σ_T: T > S when T > (μ_T * σ_S / (σ_S - σ_T)) - (μ_S * σ_T / (σ_S - σ_T)) because of the sign flip induced in the result in #2 by having (σ_S - σ_T) < 0.

Bottom line - the only case in which the correlation induction scheme works is when the variances of the two distributions are equal. Unequal variances will result in outcomes where T > S.

The following picture may give some intuition. The red curve is a standard normal with mean 0 and standard deviation 1. The green curve is a normal with mean 1 and standard deviation 2. We can see that because the green curve is wider, there is some quantile below which it produces smaller outcomes than the red. If T, the "lower" distribution, had the larger variability there would be some quantile above which it would produce larger outcomes.

I think one might be able to generate such numbers, please check my answer — Severin Pappadeux, Mar 08 '17 at 02:24
@SeverinPappadeux I think I've given concrete reasons why you're wrong. — pjs, Mar 08 '17 at 02:58
Sorry, but I'm right. Simplest case - sd is 1 for both distributions, mu_T=0, mu_S=1. CDFs are basically just shifted error functions (.5+.5*erf(x/sqrt(2)) and .5+.5*erf(x-1/sqrt(2)), difference between CDFs is ALWAYS positive (some integral from gaussian), and my application of inverse CDF will ALWAYS produce pair with S > T. And if you run marginal tests on S alone it would be gaussian. And T alone would be gaussian. They won't be independent though, but IT IS NOT A STATED CONDITION. — Severin Pappadeux, Mar 08 '17 at 03:34
@SeverinPappadeux More later when I get a chance to write it up (I'm preparing today's lecture), but your approach makes the two results perfectly correlated, i.e., a linear transformation of each other. As a result, it ***only*** works when the variances are equal, as in your "simplest case" example. Consider as a counter-example to your example the case where T is a standard normal, while S has a mean of 1 and a standard deviation of 2. It's easy to find values of S which are smaller than the corresponding quantile of T. — pjs, Mar 08 '17 at 17:04
@SeverinPappadeux As promised, I've added a further explanation of why your approach is not a general solution. — pjs, Mar 09 '17 at 00:49

Severin Pappadeux · Answer 2 · 2017-03-08T16:11:09.153

I think you can calculate such sequences, satisfying all conditions and generating the two sequences incrementally (one pair or numbers at a time). For that you have to build cumulative distribution functions CDF_S(x, m_S, sd_S) and CDF_T(x, m_T, sd_T) and be sure that for given parameters for ANY x their difference is non-negaitve: CDF_S(x) - CDF_T(x) >= 0.

Then you throw U(0,1) and for each uniformly distributed random number you calculated two gaussians using inverse CDF method (in some pseudocode):

r = random()
t = inverseCDF(r, mu_T, sd_T)
s = inverseCDF(r, mu_S, sd_S)

they will be in pair and guaranteed to be gaussian and satisfy conditions 1,2,3.

Inverse CDF will be expressed via inverse error function, so depending on libraries you're using you already might have it. If not, in Wiki there is some approximation for erf^-1(x) which might be used: https://en.wikipedia.org/wiki/Error_function

UPDATE

To make it more clear. If you do as I proposed and generate, say, data frame with N records and 2 columns (S,T), then clearly taken separately S column would be gaussian(m_S, sd_S), and T column would be gaussian(m_T, sd_T). No questions about it. They won't be independent. But you didn't state such condition. If it is ok, then that's the solution. If they have to be independent, that's impossible. So you have to make up your mind what exactly do you want

Generate two sequences of random numbers with fixed mean and sd, with ordering constraint

2 Answers2