12

I want to generate a random distribution of say 10,000 numbers with predefined min, max, mean, and sd values. I have followed this link setting upper and lower limits in rnorm to get random distribution with fixed min and max values. However, in doing so, mean value changes.

For example,

#Function to generate values between a lower limit and an upper limit.
mysamp <- function(n, m, s, lwr, upr, nnorm) {
set.seed(1)
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}  
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
} 
Account_Value <- mysamp(n=10000, m=1250000, s=4500000, lwr=50000, upr=5000000, nnorm=1000000)
summary(Account_Value)

# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 50060 1231000 2334000 2410000 3582000 5000000
#Note - though min and max values are good, mean value is very skewed for an obvious reason.
# sd(Account_Value) # 1397349

I am not sure whether we can generate a random normal distribution that meets all conditions. If there is any other sort of random distribution that can meet all conditions, please do share too.

Look forward to your inputs.

-Thank you.

Community
  • 1
  • 1
skumar
  • 353
  • 2
  • 4
  • 12
  • You might have a look at `help("distribution")`. At the bottom, there's a link to the CRAN task view on distributions which probably will provide a package that will get close to what you are looking for. – lmo Jun 12 '16 at 12:40
  • @Imo - Thank you for this suggestion. I believe you are talking about https://cran.r-project.org/web/views/Distributions.html. – skumar Jun 12 '16 at 18:24
  • Right. I probably should have just put up the link in my post. Thanks for adding it. – lmo Jun 12 '16 at 19:43

2 Answers2

9

You could use a generalized form of the beta distribution, known as the Pearson type I distribution. The standard beta distribution is defined on the interval (0,1), but you can take a linear transformation of a standard beta distributed variable to obtain values between any (min, max). The answer to this question on CrossValidated explains how to parameterize a beta distribution with its mean and variance, with certain constraints.

While it's possible to formulate both a truncated normal and a generalized beta distribution with the desired min, max, mean and sd, the shape of the two distributions will be very different. This is because the truncated normal distribution has a positive probability density at the endpoints of its support interval, while in the generalized beta distribution the density will always fall smoothly to zero at the endpoints. Which shape is more preferable will depend on your intended application.

Here's an implementation in R for generating generalized beta distributed observations with a mean, variance, min and max parameterization.

rgbeta <- function(n, mean, var, min = 0, max = 1)
{
  dmin <- mean - min
  dmax <- max - mean

  if (dmin <= 0 || dmax <= 0)
  {
    stop(paste("mean must be between min =", min, "and max =", max)) 
  }

  if (var >= dmin * dmax)
  {
    stop(paste("var must be less than (mean - min) * (max - mean) =", dmin * dmax))
  }

  # mean and variance of the standard beta distributed variable
  mx <- (mean - min) / (max - min)
  vx <- var / (max - min)^2

  # find the corresponding alpha-beta parameterization
  a <- ((1 - mx) / vx - 1 / mx) * mx^2
  b <- a * (1 / mx - 1)

  # generate standard beta observations and transform
  x <- rbeta(n, a, b)
  y <- (max - min) * x + min

  return(y)
}

set.seed(1)

n <- 10000
y <- rgbeta(n, mean = 1, var = 4, min = -4, max = 5)

sapply(list(mean, sd, min, max), function(f) f(y))
#    [1]  0.9921269  2.0154131 -3.8653859  4.9838290
Community
  • 1
  • 1
Mikko Marttila
  • 10,972
  • 18
  • 31
  • Thank you for sharing this descriptive reply. This is really useful. – skumar Jun 15 '16 at 07:03
  • For me this code gives an error "Error in f(y) : could not find function "f"" Any updates on this? – Erdne Htábrob Feb 02 '22 at 20:00
  • 1
    @Erdne I can still run this without issue on my end. As a first step I'd recommend trying again in a fresh session -- if the issue persists, try opening a new question with more details. That said, the part of this code that that error is coming from is just computing the sample statistics for the generated data to verify that they match what was expected -- it's not essential for the solution itself. – Mikko Marttila Feb 02 '22 at 22:35
5

Discussion:

Hi. It is very interesting problem. It needs quite an effort to be solved properly and not always solution can be found.

First thing is that when you truncate a distribution (set a min and max for it) standard deviation is limited (has a maximum depending on min and max values). If you want too big value of it - you can not get it.

Second restriction limits mean. It is obvious that if you want mean below minimum and above maximum it will not work, but you may want something too close to limits and still it can not be satisfied.

Third restriction limits a combination of this parameters. Im not sure how does it work, but i am pretty sure not all the combinations may be satisfied.

But there are some combinations that may work and may be found.

Solution:

The problem is: what are the parameters: mean and sd of truncated (cut) distribution with defined limits a and b, so in the end the mean will be equal to desired_mean and standard deviation will be equal to desired_sd.

It is important that values of parameters: mean and sd are used before truncation. So that is why in the end mean and deviation are diffrent.

Below is the code that solves the problem using function optim(). It may not be the best solution for this problem, but it generally works:

require(truncnorm)

eval_function <- function(mean_sd){
    mean <- mean_sd[1]
    sd <- mean_sd[2]
    sample <- rtruncnorm(n = n, a = a, b = b, mean = mean, sd = sd)
    mean_diff <-abs((desired_mean - mean(sample))/desired_mean)
    sd_diff <- abs((desired_sd - sd(sample))/desired_sd)
    mean_diff + sd_diff
}

n = 1000
a <- 1
b <- 6
desired_mean <- 3
desired_sd <- 1

set.seed(1)
o <- optim(c(desired_mean, desired_sd), eval_function)

new_n <- 10000
your_sample <- rtruncnorm(n = new_n, a = a, b = b, mean = o$par[1], sd = o$par[2])
mean(your_sample)
sd(your_sample)
min(your_sample)
max(your_sample)
eval_function(c(o$par[1], o$par[2]))

I am very interested if there are other solutions to that problem, so please post them if you find other answers.

EDIT:

@Mikko Marttila: Thanks to your comment and link: Wikipedia I implemented formulas to calculate mean and sd of truncated distribution. Now the solution is WAY more elegant and it should calculate quite accurately mean and sd of the desired distribution if they exist. It works much faster also.

I implemented eval_function2 which should be used in the optim() function instead of previous one:

eval_function2 <- function(mean_sd){
    mean <- mean_sd[1]
    sd <- mean_sd[2]

    alpha <- (a - mean)/sd
    betta <- (b - mean)/sd

    trunc_mean <- mean + sd * (dnorm(alpha, 0, 1) - dnorm(betta, 0, 1)) / 
                  (pnorm(betta, 0, 1) - pnorm(alpha, 0, 1))

    trunc_var <- (sd ^ 2) * 
                 (1 + 
                  (alpha * dnorm(alpha, 0, 1) - betta * dnorm(betta, 0, 1))/
                  (pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)) -
                 (dnorm(alpha, 0, 1) - dnorm(betta, 0, 1))/
                 (pnorm(betta, 0, 1) - pnorm(alpha, 0, 1)))

    trunc_sd <- trunc_var ^ 0.5

    mean_diff <-abs((desired_mean - trunc_mean)/desired_mean)
    sd_diff <- abs((desired_sd - trunc_sd)/desired_sd)
}
Community
  • 1
  • 1
cure
  • 425
  • 2
  • 12
  • Thank you for sharing this detailed answer. I really appreciate. I have one follow up question - initially, you calculated mean and sd on a random distribution of 1,000 numbers. Then, you used modified mean and sd on this random distribution as parameters to calculate mean and sd of a random distribution of 10,000 numbers. Should not we have 10,000 numbers in both cases? Can you please also explain what exactly is optim function doing here? This is so because using eval_function(c(desired_mean, desired_sd)) gives only a number. I would surely post any other solution that I come across. – skumar Jun 12 '16 at 18:21
  • `optim()` function finds such parameters from vector `mean_sd`, that value of eval function is lowest possible. It is an analogy to typical optimisation of a mathematical function. The only difference is that our function is an R function. – cure Jun 12 '16 at 20:36
  • About the length of sample - function `optim()` makes iterations and it is hard to say how many. Every iteration it needs to evaluate `eval_function`, so 1000 is a reasonable choice for short time of evaluating and quite good precision. If you have better CPU you can use more, but remember that time will increase. Afterwards, when i have evaluated parameters, I can take bigger sample, as it is calculated only once. – cure Jun 12 '16 at 20:40
  • As i mentioned before, the `optim()` function is not a best choice, because as you can see `eval_function` is a stochastic function - results differ every time. Function `optim()` seems good function to optimise deterministic `eval_function`, but not stochastic. The results should be quite OK, but not the best. The quality of your result you can find yourself. Just check if the result satisfy you. – cure Jun 12 '16 at 20:48
  • 1
    Note that you don't need to sample to obtain the mean and standard deviation of the truncated normal distribution. See e.g. [wikipedia](https://en.wikipedia.org/wiki/Truncated_normal_distribution) for the formulas for calculating the mean and sd for a truncated distribution from the "original" mean and sd and min & max. – Mikko Marttila Jun 13 '16 at 00:47
  • Thank you @cure for your inputs. Your inputs make things slightly easier to understand. – skumar Jun 15 '16 at 07:05
  • Thank you @Mikko Marttila to you as well. – skumar Jun 15 '16 at 07:05