28

When generating random numbers in R using rnorm (or runif etc.), they seldom have the exact mean and SD as the distribution they are sampled from. Is there any simple one-or-two-liner that does this for me? As a preliminary solution, I've created this function but it seems like something that should be native to R or some package.

# Draw sample from normal distribution with guaranteed fixed mean and sd
rnorm_fixed = function(n, mu=0, sigma=1) {
  x = rnorm(n)  # from standard normal distribution
  x = sigma * x / sd(x)  # scale to desired SD
  x = x - mean(x) + mu  # center around desired mean
  return(x)
}

To illustrate:

x = rnorm(n=20, mean=5, sd=10)
mean(x)  # is e.g. 6.813...
sd(x)  # is e.g. 10.222...

x = rnorm_fixed(n=20, mean=5, sd=10)
mean(x)  # is 5
sd(x)  # is 10

The reason I want this is that I adjust my analysis on simulated data before applying it to real data. This is nice because with simulated data I know the exact properties (means, SDs etc.) and I avoid p-value inflation because I'm doing inferential statistics. I am asking if there exist anything simple like e.g.

rnorm(n=20, mean=5, sd=10, fixed=TRUE)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Jonas Lindeløv
  • 5,442
  • 6
  • 31
  • 54
  • 2
    You can use the function `scale` to do this... but isn't this exactly illustrating the difference between sample and population statistics? As your `n` gets large `sd(x)` and `mean(x)` will approach the values you provided, but at only 20 samples you cannot expect perfect distribution... – Justin Sep 20 '13 at 14:25
  • 3
    Out of curiosity, why do you need that? I wouldn't expect a sample to have the same mean and sd as the population. – Roland Sep 20 '13 at 14:25
  • 2
    I think you've got it right. I think it's simple enough that people just do it like this when they need to. `MASS::mvrnorm` does have an analogous feature (but it's marginally trickier for the multivariate case, which is presumably why it's built in). Agree with @Justin that you could use `mu+sigma*scale(rnorm(n))` as a one-liner ... – Ben Bolker Sep 20 '13 at 14:26
  • 2
    Justin and Roland: I've added my motivation in the question :-) It's because I simulate data and want to know its properties! So yes, if I wanted this to represent the real world, these constraints would be strange. But I want a "perfect little world" to play around in, in order to know if I do things right :-) – Jonas Lindeløv Sep 20 '13 at 14:31
  • I usually just create a sample and calculate the properties. – Roland Sep 20 '13 at 14:35
  • I'm with the others: create a known sample, calculate the properties, and if you absolutely must, add/multiply by the inverse of the mean and sd to get the "round numbers" you desire. Just remember that there are lots and lots (that's the math term :-) ) of samples which will have a given sd and mean, but they can be radically different from each other. They wouldn't even have to be gaussian in nature. – Carl Witthoft Sep 20 '13 at 14:42
  • I guess it has to be a two-liner x=rnorm(n) and x=(x-mean(x))/sd(x) this will renormalise the random data. – Peter Dutton Sep 20 '13 at 14:44
  • It's just that when you do a mixed-model analysis on a 3x4 design with several random effects, then it's so much easier to check the coefficients and see that "ah, this coefficient is 3 as it should be, because that is the difference I specified between sample A and sample B" rather than "ah, this coefficient is 4.235 as it should be because that is the difference between sample A which had 22.586 and sample B which had 18.315". – Jonas Lindeløv Sep 20 '13 at 14:50
  • Dutton: right, and multiply by desired sd and add the desired mean, then you're there :-) Just made the function in case no one came up with a magical solution. Then at least others could code-reuse it. – Jonas Lindeløv Sep 20 '13 at 14:51

3 Answers3

43

Since you asked for a one-liner:

rnorm2 <- function(n,mean,sd) { mean+sd*scale(rnorm(n)) }
r <- rnorm2(100,4,1)
mean(r)  ## 4
sd(r)    ## 1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • From math'l point of view, is this erratic? One can desire the equivalence of the means of sample and the population from which it was drawn, but one should not desire the equivalences of the SDs of random numbers sample and its population. Central Limit Theorem: (X1+...+Xn)/n -> N(mean,StdDev.)=N(mu, sigma/sqrt(n)). Hence, to me, for math'l correctness, rnorm3 must be defined with SD=sigma/sqrt(n) (sigma: std dev. of the population) and m=mu. I wonder your considerations on this issue. So, if you also find rnorm3 necessary to be defined in accordance with CLT, how do you define it properly? – Erdogan CEVHER Aug 01 '19 at 20:10
  • 2
    I think, I found the answer of my question. Here, in the above code, since we are drawing **just one sample** from the population (instead of many samples from the population). Hence, it is normal that one may desire the equivalances of both means and SDs for the sample and the population. – Erdogan CEVHER Aug 01 '19 at 20:47
4

The mvrnorm() function in the MASS package can do this.

library(MASS)
#empirical=T forces mean and sd to be exact
x <- mvrnorm(n=20, mu=5, Sigma=10^2, empirical=T)
mean(x)
sd(x)
#empirical=F does not impose this constraint
x <- mvrnorm(n=20, mu=5, Sigma=10^2, empirical=F
mean(x)
sd(x)
pumphandle
  • 83
  • 5
3

This is an improvement of the function suggested in a previous answer so that it complies with the OP's need of having a "fixed" argument.

And still in one line ;-)

rnorm. <- function(n=10, mean=0, sd=1, fixed=TRUE) { switch(fixed+1, rnorm(n, mean, sd), as.numeric(mean+sd*scale(rnorm(n)))) }
rnorm.() %>% {c(mean(.), sd(.))}
#### [1] 0 1
rnorm.(,,,F) %>% {c(mean(.), sd(.))}
#### [1] 0.1871827 0.8124567

I chose to enter default values for every argument and add a as.numeric step to get rid of the attributes generated by the scale function.

agenis
  • 8,069
  • 5
  • 53
  • 102