3

I wish to simulate the central limit theorem in order to demonstrate it, and I am not sure how to do it in R. I want to create 10,000 samples with a sample size of n (can be numeric or a parameter), from a distribution I will choose (uniform, exponential, etc...). Then I want to graph in one plot (using the par and mfrow commands) the original distribution (histogram), the distribution of the means of all samples, a Q-Q plot of the means, and in the 4th graph (there are four, 2X2), I am not sure what to plot. Can you please assist me in starting to program it in R ? I think once I have the simulated data I should be fine. Thank you.

My initial attempt is below, it is too simple and I am not sure even correct.

r = 10000;
n = 20;

M = matrix(0,n,r);
Xbar = rep(0,r);

for (i in 1:r)
{
  M[,i] = runif(n,0,1);
}

for (i in 1:r)
{
  Xbar[i] = mean(M[,i]);
}

hist(Xbar);
user2899944
  • 249
  • 2
  • 11

2 Answers2

5

The CLT states that given i.i.d. samples from a distribution with mean and variance, the sample mean (as a random variable) has a distribution that converges to a Gaussian as the number of samples n increase. Here, I will assume that you want to generate r sample sets containing n samples each to create r samples of the sample mean. Some code to do that is as follows:

set.seed(123) ## set the seed for reproducibility
r <- 10000
n <- 200      ## I use 200 instead of 20 to enhance convergence to Gaussian

## this function computes the r samples of the sample mean from the 
## r*n original samples
sample.means <- function(samps, r, n) {
  rowMeans(matrix(samps,nrow=r,ncol=n))
}

For generating the plots, we use ggplot2 and Aaron's qqplot.data function from here. We also use gridExtra to plot multiple plots in one frame.

library(ggplot2)
library(gridExtra)
qqplot.data <- function (vec) {
  # following four lines from base R's qqline()
  y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
  x <- qnorm(c(0.25, 0.75))
  slope <- diff(y)/diff(x)
  int <- y[1L] - slope * x[1L]

  d <- data.frame(resids = vec)

  ggplot(d, aes(sample = resids)) + stat_qq() + geom_abline(slope = slope, intercept = int, colour="red") + ggtitle("Q-Q plot")  
}

generate.plots <- function(samps, samp.means) {
  p1 <- qplot(samps, geom="histogram", bins=30, main="Sample Histogram")
  p2 <- qplot(samp.means, geom="histogram", bins=30, main="Sample Mean Histogram")
  p3 <- qqplot.data(samp.means)
  grid.arrange(p1,p2,p3,ncol=2)
}

Then we can use these functions with the uniform distribution:

samps <- runif(r*n)  ## uniform distribution [0,1]
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)

We get:

Uniform samples

Or, with the poisson distribution with mean = 3:

samps <- rpois(r*n,lambda=3)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)

We get:

Poisson samples

Or, with the exponential distribution with mean = 1/1:

samps <- rexp(r*n,rate=1)
# compute sample means
samp.means <- sample.means(samps, r, n))
# generate plots
generate.plots(samps, samp.means)

We get:

Exponential samples

Note that the mean of the sample mean histograms all look like Gaussians with mean that is very similar to the mean of the original generating distribution, whether this is uniform, poisson, or exponential, as predicted by the CLT (also its variance will be 1/(n=200) the variance of the original generating distribution).

Community
  • 1
  • 1
aichao
  • 7,375
  • 3
  • 16
  • 18
1

Maybe this can help you get started. I have hard-coded the normal distribution and only shown two of your suggested plots: a the histogram of a randomly selected sample, and a histogram of all sample means.

I guess my main suggestion is using a list to store the samples instead of a matrix.

r <- 10000
my.n <- 20

simulation <- list()

for (i in 1:r) {
  simulation[[i]] <- rnorm(my.n)
}

sample.means <- sapply(simulation, mean)

selected.sample <- runif(1, min = 1, max = r)

dev.off()
par(mfrow = c(1, 2))
hist(simulation[[selected.sample]])
hist(sample.means)
Mark Miller
  • 3,011
  • 1
  • 14
  • 34
  • Why are you using `rnorm` to generate samples? OP wants to generate i.i.d. samples from any distribution with known mean and variance to demonstrate CLT for that distribution. – aichao Oct 28 '16 at 16:01
  • OK, it's been a few years since I really thought about the central limit theorem. Can't the desired mean and sd just be passed to rnorm as additional arguments? Agreed, I haven't done anything to draw the samples from any distribution other than the normal. – Mark Miller Oct 28 '16 at 16:06
  • This blog post does a better job of the simulation: https://qualityandinnovation.com/2015/03/30/sampling-distributions-and-central-limit-theorem-in-r/ – Mark Miller Oct 28 '16 at 16:07
  • The CLT states that given i.i.d. samples with mean and variance, the sample mean (as a random variable) has a distribution that converges to a Gaussian as the number of samples `n` increase. The remarkable thing about the CLT is that there are no assumptions on the original distribution that generated the samples other than that it has mean and variance. The OP wants to simulate `r` number of sample sets containing `n` samples each. Each sample set is to compute a sample of the sample mean, and there are `r` of them to show that the resulting sample mean distribution approaches a Gaussian. – aichao Oct 28 '16 at 16:19