I'm creating a report on statistical analysis of several distributions; more specifically random populations and how their samples differ from them with the latter adhering to properties of normal distributions while their larger populations remain skewed in most cases.
Although I'm more than satisfied with the rest of the output, I'm unable to figure out why certain numeric values and their visualisations are differing from the ones done through the command line. Here's some of the reproduced code for the discrepancy(first I generate a 1000 random exponentials):
set.seed(1000)
pop <- rexp(1000, 0.2)
In extracting say, the mean of pop
, I get the exact correct result through the console, which is 4.76475. This is the value I should be getting through the markdown output, but instead knitr displays it as 5.015616.
mean(pop)
[1] 4.76475
```{r, echo = T}
mean(pop)
```
[1] 5.015616
Its not just the mean, but in almost all of the rest of the required statistical variables for the population as well as sample. In addition, I also get wrong visualisations in the knitted output:
The plots themselves are being displayed discrepant because of the incorrect results. I thought this is a problem with the digits
setting, but digits(options)
isn't really solving it, neither is default scipen = 0
setting. I've tried inserting inline code but its still showing me the incorrect values. Referred to knitR's manual if a chunk setting was missing but couldn't really find a fault there. Is there something missing here or a bug related to random distributions?
EDIT: I noticed another peculiar property. I created a new markdown file to see if the results varied according to each new output that I created. Let's name this as test.Rmd
but it contains the same commands that I've reproduced here with the same seed. And I'm getting a totally different result now, still different from the original value from the command session.
EDIT: Roman's point seem to be working. Knitted result are coming closer to original values but are still not exactly matching. The seed set to 357 gave me a mean(pop)
of 4.881604 which is only a decimal point away from the original value. But why is seed being the game changer here? I thought it has to be 1000.
EDIT: Here's some of the code from the .Rmd file as requested by Phil.
# Load packages
library(ggplot2)
library(knitr)
library(gridExtra)
# Generate random exponentials
set.seed(357)
pop = rexp(1000,0.2) # lambs is 0.2 with n = 1000
pop.table <- as.data.frame(pop)
# Take a sample simulating 1000 averages of 40 exponentials
sample.exp = NULL
for (i in 1:1000){
sample.exp = c(sample, rexp(40, 0.2)} # n = 40 here
sample.df <- as.data.frame(sample.exp)
# Generate means and compare
mean(pop) # 4.881604
mean(sample.exp) # 4.992426
# Generate variances and compare
var(pop) # 26.07005
var(sample.exp) # 0.6562298
# Some plots
plot.means.pop <- ggplot(pop.table, aes(pop.table$pop)) + geom_histogram(binwidth = 0.9, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(pop.table$pop), colour = 'red')) + labs(title = 'Population Mean', x = 'Exponential', y = 'Frequency') + theme(legend.position = 'none') +theme(plot.title = element_text(hjust = 0.5))
plot.means.sample <- ggplot(sample.df, aes(sample.df$sample.exp)) + geom_histogram(binwidth = 0.2, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(sample.df$sample.exp)), colour = 'red', size = 0.8) + labs(title = 'Sample Mean', x = 'Exponential', y = 'Frequency') + guides(fill = F) + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(plot.means.sample, plot.means.pop, ncol = 2, nrow = 1)
So thats pretty much the main portion of the file that is giving me 'close' values if not errors or the exact results from the command line. Note: The values annotated are new values after setting the seed to 357 and I've set the same for the global environment. The values that I'm receiving at console are:
- 4.76475 for population mean
- 5.00238 for sample mean
- 21.80913 for population variance
- 0.6492991 for sample variance