5

I am working on my project about the income distribution... I would like to generate random data for testing the theory. Let say I have N=5 countries and each country has n=1000 population and i want to generate random income (NORMAL DISTRIBUTION) for each person in each population with the constraint of income is between 0 and 1 and at same mean and DIFFERENT standard deviation for all countries. I used the function rnorm(n, meanx, sd) to do it. I know that UNIFORM DISTRIBUTION (runif(n,min, max) has some arguments for setting min, max, but no rnorm. Since rnorm doesn't provide the argument for setting min and max value. I have to write a piece of code to check the set of random data to see whether they satisfy my constraints of [0,1] or not.

I successfully generated income data for n=100. However, if i increase n = k times of 100, for eg. n=200, 300 ......1000. My programme is hanging. I can see why the programs is hanging, since it just generate data randomly without constraints of min, max. Therefore, when I do with larger n, the probabilities that i will generate successfully is less than with n=100. And the loop just running again : generate data, failed check.

Technically speaking, to fix this problem, I think of breaking n=1000 into small batches, let say b=100. Since rnorm successfully generate with 100 samples in range [0,1] and it is NORMAL DISTRIBUTION, it will work well if i run the loop of 10 times of 100samples separately for each batch of 100 samples. And then, I will collect all data of 10 * 100 samples into one data of 1000 for my later analysis. However, mathematically speakign, I am NOT SURE whether the constrain of NORMAL DISTRIBUTION for n=1000 is still satisfied or not by doing this way. I attached here my code. Hopefully my explanation is clear to you. All of your opinions will be very useful to my work. Thanks a lot.

 # Update: 
# plot histogram
# create the random data with same mean, different standard deviation and x in range [0,1]

# Generate the output file 
# Generate data for K countries
#---------------------------------------------
# Configurable variables
number_of_populations = 5
n=100  #number of residents (*** input the number whish is k times of 100)
meanx = 0.7
sd_constant = 0.1 # sd = sd_constant + j/50

min=0 #min income
max=1 #max income

#---------------------------------------------
batch =100  # divide the large number of residents into small batch of 100

x= matrix(
  0,                           # the data elements 
  nrow=n,                       # number of rows 
  ncol=number_of_populations,   # number of columns 
  byrow = TRUE)                 # fill matrix by rows 

x_temp = rep(0,n)
# generate income data randomly for each country
for (j in 1:number_of_populations){
  # 1. Generate uniform distribution
  #x[,j] <- runif(n,min, max)
  # 2. Generate Normal distribution
  sd = sd_constant+j/50

  repeat
  {
{
  x_temp <- rnorm(n, meanx, sd)
  is_inside = TRUE
  for (i in 1:n){
    if (x_temp[i]<min || x_temp[i] >max) {
      is_inside = FALSE
      break
    }
  }
}   
if(is_inside==TRUE) {break}
  } #end repeat

  x[,j] <- x_temp

}


# write in csv
# each column stores different income of its residents
working_dir= "D:\\dataset\\"
setwd(working_dir)

file_output = "random_income.csv"
sink(file_output)

write.table(x,file=file_output,sep=",", col.names = F, row.names = F)
sink()
file.show(file_output) #show the file in directory

#plot histogram of x for each population
#par(mfrow=c(3,3), oma=c(0,0,0,0,0))
attach(mtcars)
par(mfrow=c(1,5)) 
for (j in 1:number_of_populations)
{
  #plot(X[,i],y,'xlab'=i)
  hist(x[,j],main="Normal",'xlab'=j)
}
Thuy Nguyen Hong
  • 243
  • 1
  • 4
  • 11
  • If you want a normal distribution, it can't be bounded as you describe. What do you want to happen to values that fall outside of [0,1]? – Thomas Nov 03 '13 at 22:22
  • Hi Thomas, I want valid data fall in [0,1] for my analysis. If the data doesn't meet the constraint, I can not use it at all. – Thuy Nguyen Hong Nov 03 '13 at 22:25
  • Maybe [this post](http://stackoverflow.com/questions/19343133/setting-upper-and-lower-limits-in-rnorm) helps – alexis_laz Nov 03 '13 at 22:31

3 Answers3

6

Here's a sensible simple way...

sampnorm01 <- function(n) qnorm(runif(n,min=pnorm(0),max=pnorm(1)))

Test it out:

mysamp <- sampnorm01(1e5)
hist(mysamp)

Thanks to @PatrickPerry, here is a generalized truncated normal, again using the inverse CDF method. It allows for different parameters on the normal and different truncation bounds.

rtnorm <- function(n, mean = 0, sd = 1, min = 0, max = 1) {
    bounds <- pnorm(c(min, max), mean, sd)
    u <- runif(n, bounds[1], bounds[2])
    qnorm(u, mean, sd)
}

Test it out:

mysamp <- rtnorm(1e5, .7, .2)
hist(mysamp)
Frank
  • 66,179
  • 8
  • 96
  • 180
  • @PatrickPerry Thanks for the edit! I changed it to keep both versions, hoping people can better see how yours works that way. I also wrote about it in prose instead of comments... just my stylistic preference. – Frank Oct 12 '17 at 21:07
4

You can normalize the data:

x = rnorm(100)

# normalize
min.x = min(x)
max.x = max(x)

x.norm = (x - min.x)/(max.x - min.x)
print(x.norm)
Fernando
  • 7,785
  • 6
  • 49
  • 81
  • Well yes, but since the data in `x` is normally distributed, as the sample size gets larger `min.x` and `max.x` head towards infinity. OP needs to define *how* they want their data constrained. – Marius Nov 03 '13 at 22:31
  • Hi Fernando, so COOL! Just very simple line of code, you make it. Thanks very much friend. You help me a lot!!! – Thuy Nguyen Hong Nov 03 '13 at 22:35
  • This changes the standard deviation of the data, from `sd` to `sd / (max.x - min.x)`. Are you sure you want that to happen? – musically_ut Nov 03 '13 at 22:56
  • Thanks musically_ut for pointing it out. As long as the data is in [0,1] it would be fine. Since I also testing data with different sd, and mean for diff scenarioes, the change in sd will not affect the validation conditions of my data. – Thuy Nguyen Hong Nov 03 '13 at 23:16
2

Here is my take on it.

The data is first normalized (at which stage the standard deviation is lost). After that, it is fitted to the range specified by the lower and upper parameters.

#' Creates a random normal distribution within the specified bounds
#' 
#' WARNING: This function does not preserve the standard deviation
#' @param n The number of values to be generated
#' @param mean The mean of the distribution
#' @param sd The standard deviation of the distribution
#' @param lower The lower limit of the distribution
#' @param upper The upper limit of the distribution
rtnorm <- function(n, mean = 0, sd = 1, lower = -1, upper = 1){
    mean = ifelse(test = (is.na(mean)|| (mean < lower) || (mean > upper)),
                  yes = mean(c(lower, upper)),
                  no = mean)
    data <- rnorm(n, mean = mean, sd = sd) # data

    if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
        drange <- range(data)            # data range
        irange <- range(lower, upper)    # input range
        data <- (data - drange[1]) / (drange[2] - drange[1]) # normalize data (make it 0 to 1)
        data <- (data * (irange[2] - irange[1])) + irange[1] # adjust to specified range
    }
    return(data)
}

Example:

a <- rtnorm(n = 1000, lower = 10, upper = 90)
range(a)
plot(hist(a, 50))

enter image description here

Mehrad Mahmoudian
  • 3,466
  • 32
  • 36
Alex Essilfie
  • 12,339
  • 9
  • 70
  • 108