0

I'm creating a report on statistical analysis of several distributions; more specifically random populations and how their samples differ from them with the latter adhering to properties of normal distributions while their larger populations remain skewed in most cases.

Although I'm more than satisfied with the rest of the output, I'm unable to figure out why certain numeric values and their visualisations are differing from the ones done through the command line. Here's some of the reproduced code for the discrepancy(first I generate a 1000 random exponentials):

set.seed(1000)
pop <- rexp(1000, 0.2)

In extracting say, the mean of pop, I get the exact correct result through the console, which is 4.76475. This is the value I should be getting through the markdown output, but instead knitr displays it as 5.015616.

mean(pop)
[1] 4.76475

```{r, echo = T}
mean(pop)
```
[1] 5.015616

Its not just the mean, but in almost all of the rest of the required statistical variables for the population as well as sample. In addition, I also get wrong visualisations in the knitted output:

Original/correct plot

Knitted plot

The plots themselves are being displayed discrepant because of the incorrect results. I thought this is a problem with the digits setting, but digits(options) isn't really solving it, neither is default scipen = 0 setting. I've tried inserting inline code but its still showing me the incorrect values. Referred to knitR's manual if a chunk setting was missing but couldn't really find a fault there. Is there something missing here or a bug related to random distributions?

EDIT: I noticed another peculiar property. I created a new markdown file to see if the results varied according to each new output that I created. Let's name this as test.Rmd but it contains the same commands that I've reproduced here with the same seed. And I'm getting a totally different result now, still different from the original value from the command session.

EDIT: Roman's point seem to be working. Knitted result are coming closer to original values but are still not exactly matching. The seed set to 357 gave me a mean(pop) of 4.881604 which is only a decimal point away from the original value. But why is seed being the game changer here? I thought it has to be 1000.

EDIT: Here's some of the code from the .Rmd file as requested by Phil.

# Load packages
library(ggplot2)
library(knitr)
library(gridExtra)

# Generate random exponentials
set.seed(357)
pop = rexp(1000,0.2) # lambs is 0.2 with n = 1000
pop.table <- as.data.frame(pop)

# Take a sample simulating 1000 averages of 40 exponentials
sample.exp = NULL
for (i in 1:1000){
     sample.exp = c(sample, rexp(40, 0.2)} # n = 40 here
     sample.df <- as.data.frame(sample.exp)

# Generate means and compare
mean(pop) # 4.881604
mean(sample.exp) # 4.992426

# Generate variances and compare
var(pop) # 26.07005
var(sample.exp) # 0.6562298

# Some plots
plot.means.pop <- ggplot(pop.table, aes(pop.table$pop)) + geom_histogram(binwidth = 0.9, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(pop.table$pop), colour = 'red')) + labs(title =  'Population Mean', x = 'Exponential', y = 'Frequency') + theme(legend.position = 'none') +theme(plot.title = element_text(hjust = 0.5))

plot.means.sample <- ggplot(sample.df, aes(sample.df$sample.exp)) + geom_histogram(binwidth = 0.2, fill = 'white', colour = 'black') + geom_vline(aes(xintercept = mean(sample.df$sample.exp)), colour = 'red', size = 0.8) + labs(title = 'Sample Mean', x = 'Exponential', y = 'Frequency') + guides(fill = F) + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(plot.means.sample, plot.means.pop, ncol = 2, nrow = 1)

So thats pretty much the main portion of the file that is giving me 'close' values if not errors or the exact results from the command line. Note: The values annotated are new values after setting the seed to 357 and I've set the same for the global environment. The values that I'm receiving at console are:

  • 4.76475 for population mean
  • 5.00238 for sample mean
  • 21.80913 for population variance
  • 0.6492991 for sample variance
shiv_90
  • 1,025
  • 3
  • 12
  • 35
  • 1
    Could you post more code that exactly duplicates the problem you're running into? Right now, with the code that you've posted, I'm a little confused by the fact that you're assigning the random values to `pop` but then calling `mean(exp)`. You've explained the problem well, but it might help to see more of what you've done. – Sam Jan 07 '17 at 15:55
  • Yore importantly, is your interactive session using the same seed as your markdown script? – Benjamin Jan 07 '17 at 17:34
  • Try setting the seed (`set.seed(357)`) before you create random variables and see if they match. – Roman Luštrik Jan 07 '17 at 18:13
  • @Sam Sincere apologies for that typo, didn't realise it but its actually `mean(pop)`. Corrected. – shiv_90 Jan 08 '17 at 05:15
  • @Phil I believe the answer is no since I've been working on this problem for a couple of days days now and hence already restarted RStudio number of times. But I'll be keeping that in mind and see if the session restart is causing this. – shiv_90 Jan 08 '17 at 05:18
  • @Benjamin Yes the seed is exactly matching and set. – shiv_90 Jan 08 '17 at 05:20
  • 1
    I think without your knitr Rmd file to look at there's very little more we can do. Can you edit your question to share a minimal reproducible example of your problem? – Phil Jan 08 '17 at 12:10
  • @Phil Thats totally ok if there's no fix for now. At least, you guys tried to help and I thank everyone for that. Let me see Roman's solution although I don't quite understand how setting seed to 357 will fix this. – shiv_90 Jan 08 '17 at 15:24
  • 1
    @ShivendraSharma I cannot reproduce your problem: both knitr and a clean console give the same answer (5.015616). As your knitr document produces this too, I suspect the problem is something to do with your console. – Phil Jan 08 '17 at 16:23
  • @Phil Added some code from the .Rmd file. But what could be wrong with the console? I've never fidgeted with default settings and I have no idea why the console would show the wrong values. – shiv_90 Jan 09 '17 at 08:34

1 Answers1

1

When asking a question on Stack Overflow it's essential to provide a minimal reproducible example. In particular, have a good read of the first answer and this advice and this will guide you through the process.

I think we've all struggled to help you (and we want to!) because we can't reproduce your issue. Compare the following R and Rmd code when run or knitted, respectively:

# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
## [1] 5.015616
var(pop)
## [1] 26.07005

and the Rmd:

---
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    echo = TRUE,
    message = TRUE,
    warning = TRUE
)
```

```{r}
# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
var(pop)
```

Which produces the following output:

# Generate random exponentials
set.seed(1000)
pop = rexp(1000, 0.2) # lambs is 0.2 with n = 1000
mean(pop)
## [1] 5.015616
var(pop)
## [1] 26.07005

As you can see, the result are identical from a clean R session and a clean knitr session. This is as expected, because the set.seed(), when set the same, should provide the same results every time (see the set.seed man page). When you change the seed to 357, the results vary together:

              | mean    | var      |
console (`R`) | 4.88... | 22.88... |
knitr (`Rmd`) | 4.88... | 22.88... |

In your second code block your knitr chunk result is correct for the 1000 seed, but the console result of 4.76 is incorrect, suggesting to me your console is producing the incorrect output. This could be for one of a few reasons:

  • You forgot to set the seed in the console before running the rexp() function. If you run this line without setting the seed the result will vary every time. Ensure you run the set.seed(1000) first or use an R script and source this to ensure steps are run through in order.
  • There's something in your global R environment that is affecting your results. This is less likely because you cleared your R environment, but this is one of the reasons it's important to create a new session from time to time, either by closing and re-opening RStudio or pressing CTRL + Shift + F10
  • There might be something set in your RProfile.site or .Rprofile that are setting an option on startup that's affecting your results. Have a look at Customizing startup to open and check your startup options, and if necessary correct them.

The output you're seeing isn't because of scipen because there are no numbers in scientific/engineering notation, and it's not digits because the differences you're seeing are more than differences in rounding.

If these suggestions still don't solve your issue, post the minimal reproducible example and try on other computers.

Community
  • 1
  • 1
Phil
  • 4,344
  • 2
  • 23
  • 33
  • Thanks for the prompt response Phil! I'll check these suggestions and report back. I'm doubting this is because of the wrong seed or there is something fishy with the console going on. And this is where the solution lies exactly, if your results match, mine should too. – shiv_90 Jan 09 '17 at 17:50
  • 1
    Silly I am; the `set.seed()` was the only thing missing here. Never knew that each new session has to be loaded in with a new seed every time. Many thanks! Can finally finish the report at peace =) – shiv_90 Jan 11 '17 at 08:12